I am attempting to write a machine learning algorithm with scikit-learn
that parses text and classifies it based on training data.
The example for using text data, taken directly from the scikit-learn
documentation, uses a CountVectorizer
to generate a sparse array for how many times each word appears.
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count_vect = CountVectorizer()
>>> X_train_counts = count_vect.fit_transform(twenty_train.data)
Unfortunately, this does not take into account any ordering of the phrases. It is possible to use larger ngrams
(CountVectorizer(ngram_range=(min, max))
) to look at specific phrases, but this increases the number of features rapidly and isn't even that great.
Is there a good way of dealing with ordered text in another way? I'm definitely open to using a natural language parser (nltk
, textblob
, etc.) along with scikit-learn
.
What about word2vec embedding? It is a neural network based embedding of words into vectors, and takes context into account. This could provide a more sophisticated set of features for your classifier.
One powerful python library for natural language processing with a good word2vec implementation is gensim. Gensim is built to be very scalable and fast, and has advanced text processing capabilities. Here is a quick outline on how to get started:
Installing
Just do easy_install -U gensim
or pip install --upgrade gensim
.
A simple word2vec example
import gensim
documents = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
model = gensim.models.Word2Vec(documents, min_count=1)
print model["survey"]
This will output the vector that "survey" maps to, which you could use for a feature input to your classifier.
Gensim has a lot of other capabilities, and it is worth getting to know it better if you're interested in Natural Language Processing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With