I'm learning classification. I read about using vectors. But I can't find an algorithm to translate a text with words to a vector. Is it about generating a hash of the words and adding a 1 to the hash location in the vector?
The keyboard shortcut to create outlines of text in Illustrator is: Shift + Command + O for Mac; Shift + Control + O for PC). With those simple steps, your text characters are vectorized!
When most people talk about turning text into a feature vector, all they mean is recording the presence of the word (token).
Two main ways to encode a vector. One is explicit, where you have a 0
for each word that is not present (but is in your vocabulary). The other way is implicit---like a sparse matrix (but just a single vector)---where you only encode terms with a frequency value >= 1
.
The main article that explains this the best is most likely the bag of words model, which is used extensively for natural language processing applications.
Suppose you have the vocabulary:
{brown, dog, fox, jumped, lazy, over, quick, the, zebra}
The sentence "the quick brown fox jumped over the lazy dog"
could be encoded as:
<1, 1, 1, 1, 1, 1, 1, 2, 0>
Remember, position is important.
The sentence "the zebra jumped"
---even though it is shorter in length---would then be encoded as:
<0, 0, 0, 1, 0, 0, 0, 1, 1>
The problem with the explicit approach is that if you have hundreds of thousands of vocabulary terms, each document will also have hundreds of thousands of terms (with mostly zero values).
In this case, the sentence "the zebra jumped"
could be encoded as:
<'jumped': 1, 'the': 1, 'zebra': 1>
where the order is arbitrary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With