I am working to create a text classification code but I having problems in encoding documents using the tokenizer.
1) I started by fitting a tokenizer on my document as in here:
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size, filters='')
tokenizer.fit_on_texts(df['data'])
2) Then I wanted to check if my data is fitted correctly so I converted into sequence as in here:
sequences = tokenizer.texts_to_sequences(df['data'])
data = pad_sequences(sequences, maxlen= num_words)
print(data)
which gave me fine output. i.e. encoded words into numbers
[[ 9628 1743 29 ... 161 52 250]
[14948 1 70 ... 31 108 78]
[ 2207 1071 155 ... 37607 37608 215]
...
[ 145 74 947 ... 1 76 21]
[ 95 11045 1244 ... 693 693 144]
[ 11 133 61 ... 87 57 24]]
Now, I wanted to convert a text into a sequence using the same method. Like this:
sequences = tokenizer.texts_to_sequences("physics is nice ")
text = pad_sequences(sequences, maxlen=num_words)
print(text)
it gave me weird output:
[[ 0 0 0 0 0 0 0 0 0 394]
[ 0 0 0 0 0 0 0 0 0 3136]
[ 0 0 0 0 0 0 0 0 0 1383]
[ 0 0 0 0 0 0 0 0 0 507]
[ 0 0 0 0 0 0 0 0 0 1]
[ 0 0 0 0 0 0 0 0 0 1261]
[ 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 1114]
[ 0 0 0 0 0 0 0 0 0 1]
[ 0 0 0 0 0 0 0 0 0 1261]
[ 0 0 0 0 0 0 0 0 0 753]]
According to Keras documentation (Keras):
texts_to_sequences(texts)
Arguments: texts: list of texts to turn to sequences.
Return: list of sequences (one per text input).
is it not supposed to encode each word to its corresponding number? then pad the text if it shorter than 50 to 50? Where is the mistake ?
I guess you should call like this:
sequences = tokenizer.texts_to_sequences(["physics is nice "])
The error is where you pad the sequences. The value to maxlen should be the maximum tokens you want, e.g. 50. So, change the lines to:
maxlen = 50
data = pad_sequences(sequences, maxlen=maxlen)
sequences = tokenizer.texts_to_sequences("physics is nice ")
text = pad_sequences(sequences, maxlen=maxlen)
This will cut the sequences to 50 tokens and fill the shorter with zeros. Watch out for the padding
option. The default is pre
that means if a sentence is shorter than maxlen
then the padded sequence will start with zeros to fill it. If you want the zeros to the end of the sequence add to the pad_sequences
the option padding='post'
.
You should try calling like this:
sequences = tokenizer.texts_to_sequences(["physics is nice"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With