Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can a sentence or a document be converted to a vector?

We have models for converting words to vectors (for example the word2vec model). Do similar models exist which convert sentences/documents into vectors, using perhaps the vectors learnt for the individual words?

like image 648
Sahil Avatar asked Jun 12 '15 05:06

Sahil


People also ask

How do you convert words into vectors?

Converting words to vectors, or word vectorization, is a natural language processing (NLP) process. The process uses language models to map words into vector space. A vector space represents each word by a vector of real numbers. It also allows words with similar meanings have similar representations.

Why there is a need of converting words to vector?

In summary, converting words into vectors, which deep learning algorithms can ingest and process, helps to formulate a much better understanding of natural language.


2 Answers

1) Skip gram method: paper here and the tool that uses it, google word2vec

2) Using LSTM-RNN to form semantic representations of sentences.

3) Representations of sentences and documents. The Paragraph vector is introduced in this paper. It is basically an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents.

4) Though this paper does not form sentence/paragraph vectors, it is simple enough to do that. One can just plug in the individual word vectors(Glove word vectors are found to give the best performance) and then can form a vector representation of the whole sentence/paragraph.

5) Using a CNN to summarize documents.

like image 160
Azrael Avatar answered Oct 08 '22 17:10

Azrael


It all depends on:

  • which vector model you're using
  • what is the purpose of the model
  • your creativity in combining word vectors into a document vector

If you've generated the model using Word2Vec, you can either try:

  • Doc2Vec: https://radimrehurek.com/gensim/models/doc2vec.html
  • Wiki2Vec: https://github.com/idio/wiki2vec

Or you can do what some people do, i.e. sum all content words in the documents and divide by the content words, e.g. https://github.com/alvations/oque/blob/master/o.py#L13 (note: line 17-18 is a hack to reduce noise):

def sent_vectorizer(sent, model):     sent_vec = np.zeros(400)     numw = 0     for w in sent:         try:             sent_vec = np.add(sent_vec, model[w])             numw+=1         except:             pass     return sent_vec / np.sqrt(sent_vec.dot(sent_vec)) 
like image 44
alvas Avatar answered Oct 08 '22 17:10

alvas