Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to train a sense2vec model

The documentation of sense2vec mentions 3 primary files - the first of them being merge_text.py. I have tried several types of inputs- txt,csv,bzipped file since merge_text.py tries to open files compressed by bzip2.

The file can be found at: https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py

What type of input format does this script require? Further, if anyone could please suggest how to train the model.

like image 457
Sushant Avatar asked Jun 21 '16 13:06

Sushant


1 Answers

I extended and adjusted the code samples from sense2vec.

You go from this input text:

"As far as Saudi Arabia and its motives, that is very simple also. The Saudis are good at money and arithmetic. Faced with the painful choice of losing money maintaining current production at US$60 per barrel or taking two million barrels per day off the market and losing much more money - it's an easy choice: take the path that is less painful. If there are secondary reasons like hurting US tight oil producers or hurting Iran and Russia, that's great, but it's really just about the money."

To this:

as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ motif|NOUN that|ADJ is|VERB very|ADV simple|ADJ also|ADV saudis|ENT are|VERB good|ADJ at|ADP money|NOUN and|CCONJ arithmetic|NOUN faced|VERB with|ADP painful_choice|NOUN of|ADP losing|VERB money|NOUN maintaining|VERB current_production|NOUN at|ADP us$|SYM 60|MONEY per|ADP barrel|NOUN or|CCONJ taking|VERB two_million|CARDINAL barrel|NOUN per|ADP day|NOUN off|ADP market|NOUN and|CCONJ losing|VERB much_more_money|NOUN it|PRON 's|VERB easy_choice|NOUN take|VERB path|NOUN that|ADJ is|VERB less|ADV painful|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP hurting|VERB us|ENT tight_oil_producer|NOUN or|CCONJ hurting|VERB iran|ENT and|CCONJ russia|ENT 's|VERB great|ADJ but|CCONJ it|PRON 's|VERB really|ADV just|ADV about|ADP money|NOUN

  • Double line breaks are interpreted as separate documents.
  • Urls are recognized as such, stripped down to domain.tld and marked as |URL
  • Nouns (also noun being part of noun phrases) are lemmatized (as motives become motifs)
  • Words with POS-tags like DET (determinate article) and PUNCT (for punctuation) are dropped

Here's the code. Let me know if you have questions.

I'll probably publish it on github.com/woltob soon.

import spacy
import re

nlp = spacy.load('en')
nlp.matcher = None

LABELS = {
    'ENT': 'ENT',
    'PERSON': 'PERSON',
    'NORP': 'ENT',
    'FAC': 'ENT',
    'ORG': 'ENT',
    'GPE': 'ENT',
    'LOC': 'ENT',
    'LAW': 'ENT',
    'PRODUCT': 'ENT',
    'EVENT': 'ENT',
    'WORK_OF_ART': 'ENT',
    'LANGUAGE': 'ENT',
    'DATE': 'DATE',
    'TIME': 'TIME',
    'PERCENT': 'PERCENT',
    'MONEY': 'MONEY',
    'QUANTITY': 'QUANTITY',
    'ORDINAL': 'ORDINAL',
    'CARDINAL': 'CARDINAL'
}

pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')

def strip_meta(text):
    text = text.replace('per cent', 'percent')
    text = text.replace('&gt;', '>').replace('&lt;', '<')
    text = pre_format_re.sub('', text)
    text = post_format_re.sub('', text)
    text = double_linebreak_re.sub('{2break}', text)
    text = single_linebreak_re.sub(' ', text)
    text = text.replace('{2break}', '\n')
    text = whitespace_re.sub(' ', text)
    text = quote_re.sub('', text)
    return text

def transform_doc(doc):
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
    for np in doc.noun_chunks:
        while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
            np = np[1:]
        np.merge(np.root.tag_, np.text, np.root.ent_type_)
    strings = []
    for sent in doc.sents:
        sentence = []
        if sent.text.strip():
            for w in sent:
                if w.is_space:
                    continue
                w_ = represent_word(w)
                if w_:
                    sentence.append(w_)
            strings.append(' '.join(sentence))
    if strings:
        return '\n'.join(strings) + '\n'
    else:
        return ''


def represent_word(word):
    if word.like_url:
        x = url_re.search(word.text.strip().lower())
        if x:
            return x.group(3)+'|URL'
        else:
            return word.text.lower().strip()+'|URL?'
    text = re.sub(r'\s', '_', word.text.strip().lower())
    tag = LABELS.get(word.ent_type_)
    # Dropping PUNCTUATION such as commas and DET like the
    if tag is None and word.pos_ not in ['PUNCT', 'DET']:
        tag = word.pos_
    elif tag is None:
        return None
    # if not word.pos_:
    #    tag = '?'
    return text + '|' + tag

corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''

corpus_stripped = strip_meta(corpus)

doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
    # only lemmatize NOUN and PROPN
    if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
        # Keep the original word with the length of the lemma, then add the white space, if it was there.:
        lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
            # print(word.text, lemma_)
        corpus_.append(lemma_)
    # print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
    # All other words are added normally.
    else:
        corpus_.append(word.text_with_ws)

result = transform_doc(nlp(''.join(corpus_)))

sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w') 
file.write(result)  
file.close() 
print(result)

You could visualise your model using Gensim in Tensorboard using this approach: https://github.com/ArdalanM/gensim2tensorboard

I'll also adjust this code to work with the sense2vec approach (e.g. the words become lowercase in the preprocessing step, just comment it out in the code).

Happy coding, woltob

like image 146
woltob Avatar answered Oct 28 '22 22:10

woltob