Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using word2vec to classify words in categories

BACKGROUND

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category.

APPROACH

I did some research and came across Word2vec. This library has a "similarity" and "mostsimilarity" function which i can use. So one brute force approach I thought of is the following:

  1. Take new input.
  2. Calculate it's similarity with each word in each vector and take an average.

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. The vector that gives me the highest similarity average would be the correct vector for the input to belong to.

ISSUE

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world.

like image 442
Dinero Avatar asked Dec 06 '17 04:12

Dinero


People also ask

Can Word2Vec be used for text classification?

After feeding the Word2Vec algorithm with our corpus, it will learn a vector representation for each word. This by itself, however, is still not enough to be used as features for text classification as each record in our data is a document not a word.

Can Word2Vec be used for classification?

In this tutorial we are going to learn how to prepare a Binary classification model using word2vec mechanism to classify the data. Also you get in-depth knowledge of word2vect internal mechanism.

How do you classify text into categories?

Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic rules. These rules instruct the system to use semantically relevant elements of a text to identify relevant categories based on its content. Each rule consists of an antecedent or pattern and a predicted category.


2 Answers

If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

Here's my solution below:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.

like image 130
Maxim Avatar answered Sep 30 '22 17:09

Maxim


Also, what its worth, PyTorch has a good and faster implementation of Glove these days.

like image 23
mithunpaul Avatar answered Sep 30 '22 17:09

mithunpaul