TypeError: expected string or bytes-like object – with Python/NLTK word_tokenize

Question

I have a dataset with ~40 columns, and am using .apply(word_tokenize) on 5 of them like so: df['token_column'] = df.column.apply(word_tokenize).

I'm getting a TypeError for only one of the columns, we'll call this problem_column

TypeError: expected string or bytes-like object

Here's the full error (stripped df and column names, and pii), I'm new to Python and am still trying to figure out which parts of the error messages are relevant:

TypeError                                 Traceback (most recent call last)
<ipython-input-51-22429aec3622> in <module>()
----> 1 df['token_column'] = df.problem_column.apply(word_tokenize)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\src\inference.pyx in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66440)()

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\__init__.py in word_tokenize(text, language, preserve_line)
    128     :type preserver_line: bool
    129     """
--> 130     sentences = [text] if preserve_line else sent_tokenize(text, language)
    131     return [token for sent in sentences
    132             for token in _treebank_word_tokenizer.tokenize(sent)]

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\__init__.py in sent_tokenize(text, language)
     95     """
     96     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 97     return tokenizer.tokenize(text)
     98 
     99 # Standard word tokenizer.

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\punkt.py in tokenize(self, text, realign_boundaries)
   1233         Given a text, returns a list of the sentences in that text.
   1234         """
-> 1235         return list(self.sentences_from_text(text, realign_boundaries))
   1236 
   1237     def debug_decisions(self, text):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\punkt.py in sentences_from_text(self, text, realign_boundaries)
   1281         follows the period.
   1282         """
-> 1283         return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
   1284 
   1285     def _slices_from_text(self, text):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\punkt.py in span_tokenize(self, text, realign_boundaries)
   1272         if realign_boundaries:
   1273             slices = self._realign_boundaries(text, slices)
-> 1274         return [(sl.start, sl.stop) for sl in slices]
   1275 
   1276     def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\punkt.py in <listcomp>(.0)
   1272         if realign_boundaries:
   1273             slices = self._realign_boundaries(text, slices)
-> 1274         return [(sl.start, sl.stop) for sl in slices]
   1275 
   1276     def sentences_from_text(self, text, realign_boundaries=True):

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\punkt.py in _realign_boundaries(self, text, slices)
   1312         """
   1313         realign = 0
-> 1314         for sl1, sl2 in _pair_iter(slices):
   1315             sl1 = slice(sl1.start + realign, sl1.stop)
   1316             if not sl2:

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\punkt.py in _pair_iter(it)
    310     """
    311     it = iter(it)
--> 312     prev = next(it)
    313     for el in it:
    314         yield (prev, el)

C:\Users\egagne\AppData\Local\Continuum\Anaconda3\lib\site-packages
ltk	okenize\punkt.py in _slices_from_text(self, text)
   1285     def _slices_from_text(self, text):
   1286         last_break = 0
-> 1287         for match in self._lang_vars.period_context_re().finditer(text):
   1288             context = match.group() + match.group('after_tok')
   1289             if self.text_contains_sentbreak(context):

TypeError: expected string or bytes-like object

The 5 columns are all character/string (as verified in SQL Server, SAS, and using .select_dtypes(include=[object])).

For good measure I used .to_string() to make sure problem_column is really and truly not anything besides a string, but I continue to get the error. If I process the columns separately good_column1-good_column4 continue to work and problem_column will still generate the error.

I've googled around and aside from stripping any numbers from the set (which I can't do, because those are meaningful) I haven't found any additional fixes.

Ekho · Accepted Answer

The problem is that you have None (NA) types in your DF. Try this:

df['label'].dropna(inplace=True)
tokens = df['label'].apply(word_tokenize)

Danish Shaikh · Answer

It might be showing an error because word_tokenize() only accept 1 string at a time. You can loop through the strings and then tokenize it.

For example:

text = "This is the first sentence. This is the second one. And this is the last one."
sentences = sent_tokenize(text)
words = [word_tokenize(sent) for sent in sentences]
print(words)

TypeError: expected string or bytes-like object – with Python/NLTK word_tokenize

Tags:

python

python-3.x

pandas

dataframe

nltk

LMGagne

2 Answers

Ekho

Danish Shaikh

Recent Activity

Donate For Us

TypeError: expected string or bytes-like object – with Python/NLTK word_tokenize

Tags:

python

python-3.x

pandas

dataframe

nltk

LMGagne

2 Answers

Ekho

Danish Shaikh

Related questions

Recent Activity

Donate For Us