We have models for converting words to vectors (for example the word2vec model). Do similar models exist which convert sentences/documents into vectors, using perhaps the vectors learnt for the individual words?
Converting words to vectors, or word vectorization, is a natural language processing (NLP) process. The process uses language models to map words into vector space. A vector space represents each word by a vector of real numbers. It also allows words with similar meanings have similar representations.
In summary, converting words into vectors, which deep learning algorithms can ingest and process, helps to formulate a much better understanding of natural language.
1) Skip gram method: paper here and the tool that uses it, google word2vec
2) Using LSTM-RNN to form semantic representations of sentences.
3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.
4) Though this paper does not form sentence/paragraph vectors, it is simple enough to do that. One can just plug in the individual word vectors(Glove word vectors are found to give the best performance) and then can form a vector representation of the whole sentence/paragraph.
5) Using a CNN to summarize documents.
It all depends on:
If you've generated the model using Word2Vec, you can either try:
Or you can do what some people do, i.e. sum all content words in the documents and divide by the content words, e.g. https://github.com/alvations/oque/blob/master/o.py#L13 (note: line 17-18 is a hack to reduce noise):
def sent_vectorizer(sent, model): sent_vec = np.zeros(400) numw = 0 for w in sent: try: sent_vec = np.add(sent_vec, model[w]) numw+=1 except: pass return sent_vec / np.sqrt(sent_vec.dot(sent_vec))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With