How to avoid tokenize words with underscore?

Question

I am trying to tokenize my texts by using "nltk.word_tokenize()" function, but it would split words connected by "_".

For example, the text "A,_B_C! is a movie!" would be split into:

['a', ',', '_b_c', '!', 'is','a','movie','!']

The result I want is:

['a,_b_c!', 'is', 'a', 'movie', '!']

My code:

import nltk

text = "A,_B_C! is a movie!"
nltk.tokenize(text.lower())

Any help would be appreciated!

mujjiga · Accepted Answer

You can first split it using space and then use word_tokenize on each word to handle punctuations

[word for sublist in [word_tokenize(x) if '_' not in x else [x] 
                       for x in text.lower().split()] for word in sublist]

Output ['a,_b_c!', 'is', 'a', 'movie', '!']

l = [word_tokenize(x) if '_' not in x else [x] for x in text.lower().split()] will return a list of list running word_tokenize only on words which dont have _.

[word for sublist in l for word in sublist] part is to flatten the list of list into a single list.

How to avoid tokenize words with underscore?

Tags:

python

tokenize

nltk

Sirui Li

1 Answers

mujjiga

Recent Activity

Donate For Us

How to avoid tokenize words with underscore?

Tags:

python

tokenize

nltk

Sirui Li

1 Answers

mujjiga

Related questions

Recent Activity

Donate For Us