Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid tokenize words with underscore?

I am trying to tokenize my texts by using "nltk.word_tokenize()" function, but it would split words connected by "_".

For example, the text "A,_B_C! is a movie!" would be split into:

['a', ',', '_b_c', '!', 'is','a','movie','!']

The result I want is:

['a,_b_c!', 'is', 'a', 'movie', '!']

My code:

import nltk

text = "A,_B_C! is a movie!"
nltk.tokenize(text.lower())

Any help would be appreciated!

like image 706
Sirui Li Avatar asked Oct 16 '22 13:10

Sirui Li


1 Answers

You can first split it using space and then use word_tokenize on each word to handle punctuations

[word for sublist in [word_tokenize(x) if '_' not in x else [x] 
                       for x in text.lower().split()] for word in sublist] 

Output ['a,_b_c!', 'is', 'a', 'movie', '!']

l = [word_tokenize(x) if '_' not in x else [x] for x in text.lower().split()] will return a list of list running word_tokenize only on words which dont have _.

[word for sublist in l for word in sublist] part is to flatten the list of list into a single list.

like image 164
mujjiga Avatar answered Oct 19 '22 05:10

mujjiga