Keras pad_sequences throwing invalid literal for int () with base 10

Question

Traceback (most recent call last):
    File ".\keras_test.py", line 62, in <module>
        X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
    File "C:\Program Files\Python36\lib\site-packages\keras\preprocessing\sequence.py", line 69, in pad_sequences
        trunc = np.asarray(trunc, dtype=dtype)
    File "C:\Program Files\Python36\lib\site-packages
umpy\core
umeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: invalid literal for int() with base 10: "plus 've added commercials experience tacky"

Hi there. I'm getting this error when trying to use the pad_sequence function of Keras. X_train is a sequence of strings, where "plus 've added commercials experience tacky" is the first of those strings.

Daniel Möller · Accepted Answer

The pad_sequence function has its default data type as 'int32':

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', 
                                           padding='pre', truncating='pre', value=0.)

The data you're passing is a string instead.

Adding to that, you can't use strings in a keras model.

You must "tokenize" those strings. Even if you may think it could pad strings, you must then decide what character it will pad with:

A space? But spaces may be meaningful characters
A Null character? The best idea, but how to increase the length of a string with null characters?
What if you're working with words instead of chars, where each token/id has a different string length?

That's why you must create a dictionary of integer id values representing each char or word in your existing data. And transform all your strings in lists of ids

Then you'd probably benefit from starting the model with an Embedding layer.

Example, if you're working with word ids:

Word 0: null word
Word 1: end of sentence
Word 2: space character (maybe not important to some languages)    
Word 3: a
Word 4: added
Word 5: am    
Word 6: and
....
Word 520: plus
Word 2014: 've
Word 
etc.....

Then your sentence would be a list with: [520, 2014, 4, ....]

Keras pad_sequences throwing invalid literal for int () with base 10

Tags:

python

python-3.x

numpy

tensorflow

keras

doofesohr

1 Answers

Daniel Möller

Recent Activity

Donate For Us

Keras pad_sequences throwing invalid literal for int () with base 10

Tags:

python

python-3.x

numpy

tensorflow

keras

doofesohr

1 Answers

Daniel Möller

Related questions

Recent Activity

Donate For Us