Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras pad_sequences throwing invalid literal for int () with base 10

Traceback (most recent call last):
    File ".\keras_test.py", line 62, in <module>
        X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
    File "C:\Program Files\Python36\lib\site-packages\keras\preprocessing\sequence.py", line 69, in pad_sequences
        trunc = np.asarray(trunc, dtype=dtype)
    File "C:\Program Files\Python36\lib\site-packages\numpy\core\numeric.py", line 531, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: invalid literal for int() with base 10: "plus 've added commercials experience tacky"

Hi there. I'm getting this error when trying to use the pad_sequence function of Keras. X_train is a sequence of strings, where "plus 've added commercials experience tacky" is the first of those strings.

like image 354
doofesohr Avatar asked Jan 30 '23 13:01

doofesohr


1 Answers

The pad_sequence function has its default data type as 'int32':

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', 
                                           padding='pre', truncating='pre', value=0.)

The data you're passing is a string instead.


Adding to that, you can't use strings in a keras model.

You must "tokenize" those strings. Even if you may think it could pad strings, you must then decide what character it will pad with:

  • A space? But spaces may be meaningful characters
  • A Null character? The best idea, but how to increase the length of a string with null characters?
  • What if you're working with words instead of chars, where each token/id has a different string length?

That's why you must create a dictionary of integer id values representing each char or word in your existing data. And transform all your strings in lists of ids

Then you'd probably benefit from starting the model with an Embedding layer.


Example, if you're working with word ids:

Word 0: null word
Word 1: end of sentence
Word 2: space character (maybe not important to some languages)    
Word 3: a
Word 4: added
Word 5: am    
Word 6: and
....
Word 520: plus
Word 2014: 've
Word 
etc.....

Then your sentence would be a list with: [520, 2014, 4, ....]

like image 94
Daniel Möller Avatar answered Feb 02 '23 08:02

Daniel Möller