I am trying to make a chatbot in keras. I am assigning every word in the vocabulary its own ID. One training sample looks like this: <code>[0 0 0 0 0 0 32 328 2839 13 192 1 ] -> [23 3289 328 2318 12 0 0 0 0 0 0 0]</code> Then I am using the Embedding layer in Keras to embedding these ID into vectors of size 32. Then I'm using LSTM layers as the hidden layers. The problem is that my output is a list of embedded ID's like so. <code>[ 0.16102183 0.1238187 0.1159694 0.13688719 0.12964118 0.12848872 0.13515817 0.13582146 0.16919741 0.15453722 ... ]</code> How can I convert these embeddings back to the words in my original vocabulary? Here is my code: <pre class="prettyprint"><code>from nltk.tokenize import word_tokenize from sklearn.feature_extraction.text import CountVectorizer from keras.models import Sequential, load_model from keras.layers import LSTM from keras.layers.embeddings import Embedding from keras.preprocessing import sequence import os import numpy as np import cPickle as pickle class Chatbot(object): def __init__(self, h_layers=1): # self.name = name self.h_layers = h_layers self.seq2seq = None self.max_length = 0 self.vocabulary = {} @staticmethod def load(model_name): with open('models/{}/chatbot_object.pkl'.format(model_name), 'rb') as pickle_file: obj = pickle.load(pickle_file) obj.seq2seq = load_model('models/{}/seq2seq.h5'.format(model_name)) return obj def train(self, x_train, y_train): count_vect = CountVectorizer() count_vect.fit(x_train) count_vect.fit(y_train) self.vocabulary = count_vect.vocabulary_ self.vocabulary.update({'<START>': len(self.vocabulary), '<END>': len(self.vocabulary) + 1, '<PAD>': len(self.vocabulary) + 2, '<UNK>': len(self.vocabulary) + 3}) for i in range(len(x_train)): x_train[i] = ['<START>'] + [w.lower() for w in word_tokenize(x_train[i])] + ['<END>'] for i in range(len(y_train)): y_train[i] = ['<START>'] + [w.lower() for w in word_tokenize(y_train[i])] + ['<END>'] for sample in x_train: if len(sample) > self.max_length: self.max_length = len(sample) for sample in y_train: if len(sample) > self.max_length: self.max_length = len(sample) for i in range(len(x_train)): x_train[i] = [self.vocabulary[w] for w in x_train[i] if w in self.vocabulary] for i in range(len(y_train)): y_train[i] = [self.vocabulary[w] for w in y_train[i] if w in self.vocabulary] x_train = sequence.pad_sequences(x_train, maxlen=self.max_length, value=self.vocabulary['<PAD>']) y_train = sequence.pad_sequences(y_train, maxlen=self.max_length, padding='post', value=self.vocabulary['<PAD>']) x_train = np.asarray(x_train) y_train = np.asarray(y_train) embedding_vector_length = 32 self.seq2seq = Sequential() self.seq2seq.add(Embedding(len(self.vocabulary), embedding_vector_length, input_length=self.max_length)) for _ in range(self.h_layers): self.seq2seq.add(LSTM(self.max_length, return_sequences=True)) self.seq2seq.add(LSTM(self.max_length)) self.seq2seq.compile(loss='cosine_proximity', optimizer='adam', metrics=['accuracy']) self.seq2seq.fit(x_train[:100], y_train[:100], epochs=5, batch_size=32) def save(self, filename): if filename not in os.listdir('models'): os.system('mkdir models/{}'.format(filename)) self.seq2seq.save('models/{}/seq2seq.h5'.format(filename)) self.seq2seq = None with open('models/{}/chatbot_object.pkl'.format(filename), 'wb') as pickle_file: pickle.dump(self, pickle_file) def respond(self, text): tokens = ['<START>'] + [w.lower() for w in word_tokenize(text)] + ['<END>'] for i in range(len(tokens)): if tokens[i] in self.vocabulary: tokens[i] = self.vocabulary[tokens[i]] else: tokens[i] = self.vocabulary['<PAD>'] x = sequence.pad_sequences([tokens], maxlen=self.max_length, value=self.vocabulary['<PAD>']) prediction = self.seq2seq.predict(x, batch_size=1) return prediction[0] </code></pre>

Embedding layers work like dense layers without a bias or activation, just optimized. The input is a one-hot vector(Really it is an integer, though conceptually it is initially converted to a one-hot), and the output is a dense vector. Because of this <code>embedding_layer.weights[0]</code> returns the matrix it would multiply the one-hot vector by. This means that if you call <code>tf.linalg.pinv(embedding_layer.weights[0])</code> you should get a matrix that when multiplied by the embedded vector, yields the one-hot vector(<code>tf.linalg.pinv</code> is the Moore pseudo-inverse of the matrix) Thus, the reverse of the embedding would be <code>tf.linalg.matmul(embedded_vector,tf.linalg.pinv(embedding_layer.weights[0]))</code> This yields a vector of the vocabulary length. You would then want to put it through a softmax function(<code>tf.nn.softmax</code>) to yield a probability distribution for each word.

reverse word embeddings in keras - python

Tags:

python

machine-learning

neural-network

python-2.7

deep-learning

I am trying to make a chatbot in keras. I am assigning every word in the vocabulary its own ID. One training sample looks like this:

[0 0 0 0 0 0 32 328 2839 13 192 1 ] -> [23 3289 328 2318 12 0 0 0 0 0 0 0]

Then I am using the Embedding layer in Keras to embedding these ID into vectors of size 32. Then I'm using LSTM layers as the hidden layers. The problem is that my output is a list of embedded ID's like so.

[ 0.16102183 0.1238187 0.1159694 0.13688719 0.12964118 0.12848872 0.13515817 0.13582146 0.16919741 0.15453722 ... ]

How can I convert these embeddings back to the words in my original vocabulary?

Here is my code:

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

import os

import numpy as np
import cPickle as pickle


class Chatbot(object):

def __init__(self, h_layers=1):
    # self.name = name
    self.h_layers = h_layers
    self.seq2seq = None
    self.max_length = 0
    self.vocabulary = {}

@staticmethod
def load(model_name):
    with open('models/{}/chatbot_object.pkl'.format(model_name), 'rb') as pickle_file:
        obj = pickle.load(pickle_file)
    obj.seq2seq = load_model('models/{}/seq2seq.h5'.format(model_name))
    return obj

def train(self, x_train, y_train):
    count_vect = CountVectorizer()
    count_vect.fit(x_train)
    count_vect.fit(y_train)

    self.vocabulary = count_vect.vocabulary_
    self.vocabulary.update({'<START>': len(self.vocabulary),
                            '<END>': len(self.vocabulary) + 1,
                            '<PAD>': len(self.vocabulary) + 2,
                            '<UNK>': len(self.vocabulary) + 3})

    for i in range(len(x_train)):
        x_train[i] = ['<START>'] + [w.lower() for w in word_tokenize(x_train[i])] + ['<END>']
    for i in range(len(y_train)):
        y_train[i] = ['<START>'] + [w.lower() for w in word_tokenize(y_train[i])] + ['<END>']

    for sample in x_train:
        if len(sample) > self.max_length:
            self.max_length = len(sample)
    for sample in y_train:
        if len(sample) > self.max_length:
            self.max_length = len(sample)

    for i in range(len(x_train)):
        x_train[i] = [self.vocabulary[w] for w in x_train[i] if w in self.vocabulary]
    for i in range(len(y_train)):
        y_train[i] = [self.vocabulary[w] for w in y_train[i] if w in self.vocabulary]

    x_train = sequence.pad_sequences(x_train, maxlen=self.max_length, value=self.vocabulary['<PAD>'])
    y_train = sequence.pad_sequences(y_train, maxlen=self.max_length, padding='post',
                                     value=self.vocabulary['<PAD>'])

    x_train = np.asarray(x_train)
    y_train = np.asarray(y_train)

    embedding_vector_length = 32

    self.seq2seq = Sequential()
    self.seq2seq.add(Embedding(len(self.vocabulary), embedding_vector_length, input_length=self.max_length))

    for _ in range(self.h_layers):
        self.seq2seq.add(LSTM(self.max_length, return_sequences=True))

    self.seq2seq.add(LSTM(self.max_length))
    self.seq2seq.compile(loss='cosine_proximity', optimizer='adam', metrics=['accuracy'])
    self.seq2seq.fit(x_train[:100], y_train[:100], epochs=5, batch_size=32)

def save(self, filename):
    if filename not in os.listdir('models'):
        os.system('mkdir models/{}'.format(filename))
    self.seq2seq.save('models/{}/seq2seq.h5'.format(filename))
    self.seq2seq = None
    with open('models/{}/chatbot_object.pkl'.format(filename), 'wb') as pickle_file:
        pickle.dump(self, pickle_file)

def respond(self, text):
    tokens = ['<START>'] + [w.lower() for w in word_tokenize(text)] + ['<END>']
    for i in range(len(tokens)):
        if tokens[i] in self.vocabulary:
            tokens[i] = self.vocabulary[tokens[i]]
        else:
            tokens[i] = self.vocabulary['<PAD>']
    x = sequence.pad_sequences([tokens], maxlen=self.max_length, value=self.vocabulary['<PAD>'])
    prediction = self.seq2seq.predict(x, batch_size=1)
    return prediction[0]

835

asked Aug 19 '17 16:08

Noah Chalifour

2 Answers

Embedding layers work like dense layers without a bias or activation, just optimized. The input is a one-hot vector(Really it is an integer, though conceptually it is initially converted to a one-hot), and the output is a dense vector. Because of this embedding_layer.weights[0] returns the matrix it would multiply the one-hot vector by. This means that if you call tf.linalg.pinv(embedding_layer.weights[0]) you should get a matrix that when multiplied by the embedded vector, yields the one-hot vector(tf.linalg.pinv is the Moore pseudo-inverse of the matrix) Thus, the reverse of the embedding would be tf.linalg.matmul(embedded_vector,tf.linalg.pinv(embedding_layer.weights[0])) This yields a vector of the vocabulary length. You would then want to put it through a softmax function(tf.nn.softmax) to yield a probability distribution for each word.

160

answered Oct 17 '22 22:10

John Smith

I could not find the answer to this either, so I wrote a lookup function.

def lookup(tokenizer, vec, returnIntNotWord=True):
    twordkey = [(k, tokenizer.word_index[k]) for k in sorted(tokenizer.word_index, key=tokenizer.word_index.get, reverse=False)]
    oneHotVec = [] #captures the index of the ords
    engVec = [] #this one returns the indexs and the words. Make sure returnIntNotWord is false though
    for eachRow, notUsed in enumerate(vec):
        for index, item in enumerate(vec[0]):
            if vec[eachRow][index] == 1:
                oneHotVec.append(index)
    for index in oneHotVec:
        engVec.append(twordkey[index])
    if returnIntNotWord == True:
        return oneHotVec
    else:
        return engVec

Tokenizer is the Keras Tokenizer.
Vec is a 2D vector of One-Hot encoded labels
ReturnIntNotWord, this is in comments.

answered Oct 17 '22 22:10

Definity

Related questions
                            
                                Selecting rows in a MultiIndexed dataframe
                            
                                Tensorflow slim pre-trained alexnet [closed]
                            
                                Matplotlib reuse figure created by another script
                            
                                How to turn off events.out.tfevents file in tf.contrib.learn Estimator
                            
                                resolving YAML files and substituting into templates
                            
                                Geany autocomplete Python constraints
                            
                                running python script as a systemd service
                            
                                Why are Conda Virtual Environments so big?
                            
                                How to replace a value within a tensor by indices?
                            
                                How to install dbus-python on macOS?
                            
                                Practical Use of Reversed Set Operators in Python
                            
                                Split queue into train/test set
                            
                                How Yolo calculate P(Object) in the YOLO 9000
                            
                                Attaching a pre-built query to a scoped_session in SQLAlchemy
                            
                                Missing application resource while running script in pyspark
                            
                                Why close a cursor for Sqlite3 in Python
                            
                                Apply sklearn trained model on a dataframe with PySpark
                            
                                Connecting python to cassandra a cluster from windows with DseAuthenticator and DseAuthorizer
                            
                                pass fixture to test class in pytest
                            
                                Count subtests in Python unittests separately

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With