how to transform a text to vector

Tags:

I'm learning classification. I read about using vectors. But I can't find an algorithm to translate a text with words to a vector. Is it about generating a hash of the words and adding a 1 to the hash location in the vector?

595

asked Jun 11 '13 20:06

broersa

1 Answers

When most people talk about turning text into a feature vector, all they mean is recording the presence of the word (token).

Two main ways to encode a vector. One is explicit, where you have a 0 for each word that is not present (but is in your vocabulary). The other way is implicit---like a sparse matrix (but just a single vector)---where you only encode terms with a frequency value >= 1.

Bag of words model

The main article that explains this the best is most likely the bag of words model, which is used extensively for natural language processing applications.

Explicit BoW vector example:

Suppose you have the vocabulary:

{brown, dog, fox, jumped, lazy, over, quick, the, zebra}

The sentence "the quick brown fox jumped over the lazy dog" could be encoded as:

<1, 1, 1, 1, 1, 1, 1, 2, 0>

Remember, position is important.

The sentence "the zebra jumped"---even though it is shorter in length---would then be encoded as:

<0, 0, 0, 1, 0, 0, 0, 1, 1>

The problem with the explicit approach is that if you have hundreds of thousands of vocabulary terms, each document will also have hundreds of thousands of terms (with mostly zero values).

Implicit BoW vector example:

In this case, the sentence "the zebra jumped" could be encoded as:

<'jumped': 1, 'the': 1, 'zebra': 1>

where the order is arbitrary.

123

answered Nov 07 '22 05:11

Wesley Baugh

Related questions
                            
                                Flutter TFLite Error: "metal_delegate.h" File Not Found
                            
                                What is evaluation of a cluster in WEKA?
                            
                                Using LIBSVM grid.py for unbalanced data?
                            
                                Vowpal Wabbit Logistic Regression
                            
                                Ground Truth and training data set
                            
                                Scikit Learn - Calculating TF-IDF from a corpus of arrays of features instead of from a corpus of raw documents
                            
                                Trouble understanding Convolutional Neural Network
                            
                                How to update an SVM model with new data
                            
                                Why xgboost.cv and sklearn.cross_val_score give different results?
                            
                                What is row slicing vs What is column slicing?
                            
                                How to list all classification/regression/clustering algorithms in scikit-learn?
                            
                                Keras Realtime Augmentation adding Noise and Contrast
                            
                                How to calculate the actual size of a .fit()-trained model in sklearn?
                            
                                How to visualize TensorFlow Estimator weights?
                            
                                How to do multi-class image classification in keras?
                            
                                Using sample_weights with fit_generator()
                            
                                AttributeError: 'int' object has no attribute 'lower' in TFIDF and CountVectorizer
                            
                                MemoryError: Unable to allocate MiB for an array with shape and data type, when using anymodel.fit() in sklearn
                            
                                scikit-learn GMM produce positive log probability
                            
                                C++ accumulator library with ability to remove old samples

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to transform a text to vector

Tags:

machine-learning

classification