Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retain ordering of text data when vectorizing

I am attempting to write a machine learning algorithm with scikit-learn that parses text and classifies it based on training data.

The example for using text data, taken directly from the scikit-learn documentation, uses a CountVectorizer to generate a sparse array for how many times each word appears.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> X_train_counts = count_vect.fit_transform(twenty_train.data)

Unfortunately, this does not take into account any ordering of the phrases. It is possible to use larger ngrams (CountVectorizer(ngram_range=(min, max))) to look at specific phrases, but this increases the number of features rapidly and isn't even that great.

Is there a good way of dealing with ordered text in another way? I'm definitely open to using a natural language parser (nltk, textblob, etc.) along with scikit-learn.

like image 353
2Cubed Avatar asked Oct 30 '22 23:10

2Cubed


1 Answers

What about word2vec embedding? It is a neural network based embedding of words into vectors, and takes context into account. This could provide a more sophisticated set of features for your classifier.

One powerful python library for natural language processing with a good word2vec implementation is gensim. Gensim is built to be very scalable and fast, and has advanced text processing capabilities. Here is a quick outline on how to get started:

Installing

Just do easy_install -U gensim or pip install --upgrade gensim.

A simple word2vec example

import gensim

documents = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

model = gensim.models.Word2Vec(documents, min_count=1)
print model["survey"]

This will output the vector that "survey" maps to, which you could use for a feature input to your classifier.

Gensim has a lot of other capabilities, and it is worth getting to know it better if you're interested in Natural Language Processing.

like image 95
bpachev Avatar answered Nov 15 '22 06:11

bpachev