How to train a sense2vec model

Question

The documentation of sense2vec mentions 3 primary files - the first of them being merge_text.py. I have tried several types of inputs- txt,csv,bzipped file since merge_text.py tries to open files compressed by bzip2.

The file can be found at: https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py

What type of input format does this script require? Further, if anyone could please suggest how to train the model.

woltob · Accepted Answer

I extended and adjusted the code samples from sense2vec.

You go from this input text:

"As far as Saudi Arabia and its motives, that is very simple also. The Saudis are good at money and arithmetic. Faced with the painful choice of losing money maintaining current production at US$60 per barrel or taking two million barrels per day off the market and losing much more money - it's an easy choice: take the path that is less painful. If there are secondary reasons like hurting US tight oil producers or hurting Iran and Russia, that's great, but it's really just about the money."

To this:

Double line breaks are interpreted as separate documents.
Urls are recognized as such, stripped down to domain.tld and marked as |URL
Nouns (also noun being part of noun phrases) are lemmatized (as motives become motifs)
Words with POS-tags like DET (determinate article) and PUNCT (for punctuation) are dropped

Here's the code. Let me know if you have questions.

I'll probably publish it on github.com/woltob soon.

import spacy
import re

nlp = spacy.load('en')
nlp.matcher = None

LABELS = {
    'ENT': 'ENT',
    'PERSON': 'PERSON',
    'NORP': 'ENT',
    'FAC': 'ENT',
    'ORG': 'ENT',
    'GPE': 'ENT',
    'LOC': 'ENT',
    'LAW': 'ENT',
    'PRODUCT': 'ENT',
    'EVENT': 'ENT',
    'WORK_OF_ART': 'ENT',
    'LANGUAGE': 'ENT',
    'DATE': 'DATE',
    'TIME': 'TIME',
    'PERCENT': 'PERCENT',
    'MONEY': 'MONEY',
    'QUANTITY': 'QUANTITY',
    'ORDINAL': 'ORDINAL',
    'CARDINAL': 'CARDINAL'
}

pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?://)?([a-z0-9-]+\.)?([\d\w]+?\.[^/]{2,63})')
single_linebreak_re = re.compile('
')
double_linebreak_re = re.compile('
{2,}')
whitespace_re = re.compile(r'[ 	]+')
quote_re = re.compile(r'"|`|´')

def strip_meta(text):
    text = text.replace('per cent', 'percent')
    text = text.replace('&gt;', '>').replace('&lt;', '<')
    text = pre_format_re.sub('', text)
    text = post_format_re.sub('', text)
    text = double_linebreak_re.sub('{2break}', text)
    text = single_linebreak_re.sub(' ', text)
    text = text.replace('{2break}', '
')
    text = whitespace_re.sub(' ', text)
    text = quote_re.sub('', text)
    return text

def transform_doc(doc):
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
    for np in doc.noun_chunks:
        while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
            np = np[1:]
        np.merge(np.root.tag_, np.text, np.root.ent_type_)
    strings = []
    for sent in doc.sents:
        sentence = []
        if sent.text.strip():
            for w in sent:
                if w.is_space:
                    continue
                w_ = represent_word(w)
                if w_:
                    sentence.append(w_)
            strings.append(' '.join(sentence))
    if strings:
        return '
'.join(strings) + '
'
    else:
        return ''


def represent_word(word):
    if word.like_url:
        x = url_re.search(word.text.strip().lower())
        if x:
            return x.group(3)+'|URL'
        else:
            return word.text.lower().strip()+'|URL?'
    text = re.sub(r'\s', '_', word.text.strip().lower())
    tag = LABELS.get(word.ent_type_)
    # Dropping PUNCTUATION such as commas and DET like the
    if tag is None and word.pos_ not in ['PUNCT', 'DET']:
        tag = word.pos_
    elif tag is None:
        return None
    # if not word.pos_:
    #    tag = '?'
    return text + '|' + tag

corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''

corpus_stripped = strip_meta(corpus)

doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
    # only lemmatize NOUN and PROPN
    if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
        # Keep the original word with the length of the lemma, then add the white space, if it was there.:
        lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
            # print(word.text, lemma_)
        corpus_.append(lemma_)
    # print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
    # All other words are added normally.
    else:
        corpus_.append(word.text_with_ws)

result = transform_doc(nlp(''.join(corpus_)))

sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w') 
file.write(result)  
file.close() 
print(result)

You could visualise your model using Gensim in Tensorboard using this approach: https://github.com/ArdalanM/gensim2tensorboard

I'll also adjust this code to work with the sense2vec approach (e.g. the words become lowercase in the preprocessing step, just comment it out in the code).

Happy coding, woltob

How to train a sense2vec model

Tags:

python

word2vec

spacy

Sushant

1 Answers

I'll probably publish it on github.com/woltob soon.

woltob

Recent Activity

Donate For Us

How to train a sense2vec model

Tags:

python

word2vec

spacy

Sushant

1 Answers

I'll probably publish it on github.com/woltob soon.

woltob

Related questions

Recent Activity

Donate For Us