Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Paragraph embedding with ELMo

I'm trying to understand how to prepare paragraphs for ELMo vectorization.

The docs only show how to embed multiple sentences/words at the time.

eg.

sentences = [["the", "cat", "is", "on", "the", "mat"],
         ["dogs", "are", "in", "the", "fog", ""]]
elmo(
     inputs={
          "tokens": sentences,
          "sequence_len": [6, 5]
            },
     signature="tokens",
     as_dict=True
    )["elmo"]

As I understand, this will return 2 vectors each representing a given sentence. How would I go about preparing input data to vectorize a whole paragraph containing multiple sentences. Note that I would like to use my own preprocessing.

Can this be done like so?

sentences = [["<s>" "the", "cat", "is", "on", "the", "mat", ".", "</s>", 
              "<s>", "dogs", "are", "in", "the", "fog", ".", "</s>"]]

or maybe like so?

sentences = [["the", "cat", "is", "on", "the", "mat", ".", 
              "dogs", "are", "in", "the", "fog", "."]]
like image 590
tensa11 Avatar asked Nov 17 '25 06:11

tensa11


1 Answers

ELMo produces contextual word vectors. So the word vector corresponding to a word is a function of the word and the context, e.g., sentence, it appears in.

Like your example from the docs, you want your paragraph to be a list of sentences, which are lists of tokens. So your second example. To get this format, you could use the spacy tokenizer

import spacy

# you need to install the language model first. See spacy docs.
nlp = spacy.load('en_core_web_sm')

text = "The cat is on the mat. Dogs are in the fog."
toks = nlp(text)
sentences = [[w.text for w in s] for s in toks.sents]

I don't think you need the extra padding "" on the second sentence as sequence_len takes care of this.

Update:

As I understand, this will return 2 vectors each representing a given sentence

No, this will return a vector for each word, in each sentence. If you want the whole paragraph to be the context (for each word), just change it to

sentences = [["the", "cat", "is", "on", "the", "mat", "dogs", "are", "in", "the", "fog"]]

and

...
"sequence_len": [11]
like image 151
al0 Avatar answered Nov 19 '25 21:11

al0