Using word2vec to classify words in categories

Tags:

BACKGROUND

I have vectors with some sample data and each vector has a category name (Places,Colors,Names).

['john','jay','dan','nathan','bob']  -> 'Names'
['yellow', 'red','green'] -> 'Colors'
['tokyo','bejing','washington','mumbai'] -> 'Places'

My objective is to train a model that take a new input string and predict which category it belongs to. For example if a new input is "purple" then I should be able to predict 'Colors' as the correct category. If the new input is "Calgary" it should predict 'Places' as the correct category.

APPROACH

I did some research and came across Word2vec. This library has a "similarity" and "mostsimilarity" function which i can use. So one brute force approach I thought of is the following:

Take new input.
Calculate it's similarity with each word in each vector and take an average.

So for instance for input "pink" I can calculate its similarity with words in vector "names" take a average and then do that for the other 2 vectors also. The vector that gives me the highest similarity average would be the correct vector for the input to belong to.

ISSUE

Given my limited knowledge in NLP and machine learning I am not sure if that is the best approach and hence I am looking for help and suggestions on better approaches to solve my problem. I am open to all suggestions and also please point out any mistakes I may have made as I am new to machine learning and NLP world.

442

asked Dec 06 '17 04:12

Dinero

2 Answers

If you're looking for the simplest / fastest solution then I'd suggest you take the pre-trained word embeddings (Word2Vec or GloVe) and just build a simple query system on top of it. The vectors have been trained on a huge corpus and are likely to contain good enough approximation to your domain data.

Here's my solution below:

import numpy as np

# Category -> words
data = {
  'Names': ['john','jay','dan','nathan','bob'],
  'Colors': ['yellow', 'red','green'],
  'Places': ['tokyo','bejing','washington','mumbai'],
}
# Words -> category
categories = {word: key for key, words in data.items() for word in words}

# Load the whole embedding matrix
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    embed = np.array(values[1:], dtype=np.float32)
    embeddings_index[word] = embed
print('Loaded %s word vectors.' % len(embeddings_index))
# Embeddings for available words
data_embeddings = {key: value for key, value in embeddings_index.items() if key in categories.keys()}

# Processing the query
def process(query):
  query_embed = embeddings_index[query]
  scores = {}
  for word, embed in data_embeddings.items():
    category = categories[word]
    dist = query_embed.dot(embed)
    dist /= len(data[category])
    scores[category] = scores.get(category, 0) + dist
  return scores

# Testing
print(process('pink'))
print(process('frank'))
print(process('moscow'))

In order to run it, you'll have to download and unpack the pre-trained GloVe data from here (careful, 800Mb!). Upon running, it should produce something like this:

{'Colors': 24.655489603678387, 'Names': 5.058711671829224, 'Places': 0.90213905274868011}
{'Colors': 6.8597321510314941, 'Names': 15.570847320556641, 'Places': 3.5302454829216003}
{'Colors': 8.2919375101725254, 'Names': 4.58830726146698, 'Places': 14.7840416431427}

... which looks pretty reasonable. And that's it! If you don't need such a big model, you can filter the words in glove according to their tf-idf score. Remember that the model size only depends on the data you have and words you might want to be able to query.

130

answered Sep 30 '22 17:09

Maxim

Also, what its worth, PyTorch has a good and faster implementation of Glove these days.

answered Sep 30 '22 17:09

mithunpaul

Related questions
                            
                                List on python appending always the same value [duplicate]
                            
                                Declare a static variable in an enum class
                            
                                Graph-tool surprisingly slow compared to Networkx
                            
                                Possible to make custom string literal prefixes in Python?
                            
                                how can I parse json with a single line python command?
                            
                                Python pretty print dictionary of lists, abbreviate long lists
                            
                                Retrieve company name with ticker symbol input, yahoo or google API
                            
                                Pandas replace with default value
                            
                                How to change PyCharms docstring autocomplete?
                            
                                How to read binary files in Python using NumPy?
                            
                                Keras + tensorflow gives the error "no attribute 'control_flow_ops'"
                            
                                Keras custom decision threshold for precision and recall
                            
                                Pandas mapping to TRUE/FALSE as String, not Boolean
                            
                                Handling errors in psycopg2 - one error seems to create more?
                            
                                Pandas: Count the first consecutive True values
                            
                                How to 'see' / highlight tabs and spaces in PyCharm for checking indentation?
                            
                                How to remove or change the default help command?
                            
                                How to mock os.listdir to pretend files and directories in Python?
                            
                                flask-jwt-extended: Fake Authorization Header during testing (pytest)
                            
                                reading special characters text from .ini file in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using word2vec to classify words in categories

Tags:

python

machine-learning

nlp

gensim

word2vec

Dinero

People also ask

2 Answers

Maxim

mithunpaul

Recent Activity

Donate For Us