Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reverse word embeddings in keras - python

I am trying to make a chatbot in keras. I am assigning every word in the vocabulary its own ID. One training sample looks like this:

[0 0 0 0 0 0 32 328 2839 13 192 1 ] -> [23 3289 328 2318 12 0 0 0 0 0 0 0]

Then I am using the Embedding layer in Keras to embedding these ID into vectors of size 32. Then I'm using LSTM layers as the hidden layers. The problem is that my output is a list of embedded ID's like so.

[ 0.16102183 0.1238187 0.1159694 0.13688719 0.12964118 0.12848872 0.13515817 0.13582146 0.16919741 0.15453722 ... ]

How can I convert these embeddings back to the words in my original vocabulary?

Here is my code:

from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

import os

import numpy as np
import cPickle as pickle


class Chatbot(object):

def __init__(self, h_layers=1):
    # self.name = name
    self.h_layers = h_layers
    self.seq2seq = None
    self.max_length = 0
    self.vocabulary = {}

@staticmethod
def load(model_name):
    with open('models/{}/chatbot_object.pkl'.format(model_name), 'rb') as pickle_file:
        obj = pickle.load(pickle_file)
    obj.seq2seq = load_model('models/{}/seq2seq.h5'.format(model_name))
    return obj

def train(self, x_train, y_train):
    count_vect = CountVectorizer()
    count_vect.fit(x_train)
    count_vect.fit(y_train)

    self.vocabulary = count_vect.vocabulary_
    self.vocabulary.update({'<START>': len(self.vocabulary),
                            '<END>': len(self.vocabulary) + 1,
                            '<PAD>': len(self.vocabulary) + 2,
                            '<UNK>': len(self.vocabulary) + 3})

    for i in range(len(x_train)):
        x_train[i] = ['<START>'] + [w.lower() for w in word_tokenize(x_train[i])] + ['<END>']
    for i in range(len(y_train)):
        y_train[i] = ['<START>'] + [w.lower() for w in word_tokenize(y_train[i])] + ['<END>']

    for sample in x_train:
        if len(sample) > self.max_length:
            self.max_length = len(sample)
    for sample in y_train:
        if len(sample) > self.max_length:
            self.max_length = len(sample)

    for i in range(len(x_train)):
        x_train[i] = [self.vocabulary[w] for w in x_train[i] if w in self.vocabulary]
    for i in range(len(y_train)):
        y_train[i] = [self.vocabulary[w] for w in y_train[i] if w in self.vocabulary]

    x_train = sequence.pad_sequences(x_train, maxlen=self.max_length, value=self.vocabulary['<PAD>'])
    y_train = sequence.pad_sequences(y_train, maxlen=self.max_length, padding='post',
                                     value=self.vocabulary['<PAD>'])

    x_train = np.asarray(x_train)
    y_train = np.asarray(y_train)

    embedding_vector_length = 32

    self.seq2seq = Sequential()
    self.seq2seq.add(Embedding(len(self.vocabulary), embedding_vector_length, input_length=self.max_length))

    for _ in range(self.h_layers):
        self.seq2seq.add(LSTM(self.max_length, return_sequences=True))

    self.seq2seq.add(LSTM(self.max_length))
    self.seq2seq.compile(loss='cosine_proximity', optimizer='adam', metrics=['accuracy'])
    self.seq2seq.fit(x_train[:100], y_train[:100], epochs=5, batch_size=32)

def save(self, filename):
    if filename not in os.listdir('models'):
        os.system('mkdir models/{}'.format(filename))
    self.seq2seq.save('models/{}/seq2seq.h5'.format(filename))
    self.seq2seq = None
    with open('models/{}/chatbot_object.pkl'.format(filename), 'wb') as pickle_file:
        pickle.dump(self, pickle_file)

def respond(self, text):
    tokens = ['<START>'] + [w.lower() for w in word_tokenize(text)] + ['<END>']
    for i in range(len(tokens)):
        if tokens[i] in self.vocabulary:
            tokens[i] = self.vocabulary[tokens[i]]
        else:
            tokens[i] = self.vocabulary['<PAD>']
    x = sequence.pad_sequences([tokens], maxlen=self.max_length, value=self.vocabulary['<PAD>'])
    prediction = self.seq2seq.predict(x, batch_size=1)
    return prediction[0]
like image 835
Noah Chalifour Avatar asked Aug 19 '17 16:08

Noah Chalifour


People also ask

Why Word2vec is better than TF IDF?

Some key differences between TF-IDF and word2vec is that TF-IDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other ...

Is Word2vec a CBOW?

BoW is different from Word2vec, which we cover in a different post. The main difference is that Word2vec produces one vector per word, whereas BoW produces one number (a wordcount). Word2vec is great for digging into documents and identifying content and subsets of content.

What is Wordtovec?

Word2vec is a technique for natural language processing published in 2013. The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.


2 Answers

Embedding layers work like dense layers without a bias or activation, just optimized. The input is a one-hot vector(Really it is an integer, though conceptually it is initially converted to a one-hot), and the output is a dense vector. Because of this embedding_layer.weights[0] returns the matrix it would multiply the one-hot vector by. This means that if you call tf.linalg.pinv(embedding_layer.weights[0]) you should get a matrix that when multiplied by the embedded vector, yields the one-hot vector(tf.linalg.pinv is the Moore pseudo-inverse of the matrix) Thus, the reverse of the embedding would be tf.linalg.matmul(embedded_vector,tf.linalg.pinv(embedding_layer.weights[0])) This yields a vector of the vocabulary length. You would then want to put it through a softmax function(tf.nn.softmax) to yield a probability distribution for each word.

like image 160
John Smith Avatar answered Oct 17 '22 22:10

John Smith



I could not find the answer to this either, so I wrote a lookup function.

def lookup(tokenizer, vec, returnIntNotWord=True):
    twordkey = [(k, tokenizer.word_index[k]) for k in sorted(tokenizer.word_index, key=tokenizer.word_index.get, reverse=False)]
    oneHotVec = [] #captures the index of the ords
    engVec = [] #this one returns the indexs and the words. Make sure returnIntNotWord is false though
    for eachRow, notUsed in enumerate(vec):
        for index, item in enumerate(vec[0]):
            if vec[eachRow][index] == 1:
                oneHotVec.append(index)
    for index in oneHotVec:
        engVec.append(twordkey[index])
    if returnIntNotWord == True:
        return oneHotVec
    else:
        return engVec

Tokenizer is the Keras Tokenizer.
Vec is a 2D vector of One-Hot encoded labels
ReturnIntNotWord, this is in comments.

like image 1
Definity Avatar answered Oct 17 '22 22:10

Definity