Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP, spaCy: Strategy for improving document similarity

One sentence backdrop: I have text data from auto-transcribed talks, and I want to compare their similarity of their content (e.g. what they are talking about) to do clustering and recommendation. I am quite new to NLP.


Data: The data I am using is available here. For all the lazy ones

clone https://github.com/TMorville/transcribed_data

and here is a snippet of code to put it in a df:

import os, json
import pandas as pd

from pandas.io.json import json_normalize 

def td_to_df():
    
    path_to_json = '#FILL OUT PATH'
    json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('td.json')]

    tddata = pd.DataFrame(columns=['trans', 'confidence'])

    for index, js in enumerate(json_files):
        with open(os.path.join(path_to_json, js)) as json_file:
            json_text = json_normalize(json.load(json_file))

            tddata['trans'].loc[index] = str(json_text['trans'][0])
            tddata['confidence'].loc[index] = str(json_text['confidence'][0])

    return tddata

Approach: So far, I have only used the spaCy package to do "out of the box" similarity. I simply apply the nlp model on the entire corpus of text, and compare it to all others.

def similarity_get():
    
    tddata = td_to_df()
    
    nlp = spacy.load('en_core_web_lg')
    
    baseline = nlp(tddata.trans[0])
    
    for text in tddata.trans:
        print (baseline.similarity(nlp(text)))

Problem: Practically all similarities comes out as > 0.95. This is more or less independent of the baseline. Now, this may not come a major surprise given the lack of preprocessing.


Solution strategy: Following the advice in this post, I would like to do the following (using spaCy where possible): 1) Remove stop words. 2) Remove most frequent words. 3) Merge word pairs. 4) Possibly use Doc2Vec outside of spaCy.


Questions: Does the above seem like a sound strategy? If no, what's missing? If yes, how much of this already happening under the hood by using the pre-trained model loaded in nlp = spacy.load('en_core_web_lg')?

I can't seem find the documentation that demonstrates what exactly these models are doing, or how to configure it. A quick google search yields nothing and even the, very neat, api documentation does not seem to help. Perhaps I am looking in the wrong place?

like image 903
tmo Avatar asked Jun 07 '18 14:06

tmo


People also ask

Which algorithm is used by spaCy to find similarity between two words in a document?

These word vectors are generated using an algorithm called word2vec, which can be trained using any open-sources libraries such as Gensim or FastText. Fortunately, spaCy has its own words vectors built-in that are ready to be used (only applicable for certain language and models).

Does spaCy use cosine similarity?

SpaCy uses the cosine similarity in the backend to compute .

What is the maximum similarity score in spaCy?

Word similarity is a number between 0 to 1 which tells us how close two words are, semantically.

What does NLP () do in spaCy?

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.


1 Answers

You can do most of that with SpaCY and some regexes.

So, you have to take a look at the SpaCY API documentation.

Basic steps in any NLP pipeline are the following:

  1. Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type python -m spacy download en and then import it to the preprocessing script like this:

    import spacy
    nlp = spacy.load('en')
    
  2. Tokenization - this is the process of splitting the text into words. It's not enough to just do text.split() (ex. there's would be treated as a single word but it's actually two words there and is). So here we use Tokenizers. In SpaCy you can do something like:

    nlp_doc = nlp(text)
    

where text is your dataset corpus or a sample from a dataset. You can read more about the document instance here

  1. Punctuation removal - pretty self explanatory process, done by the method in the previous step. To remove punctuation, just type:

    import re
    
    # removing punctuation tokens
    text_no_punct = [token.text for token in doc if not token.is_punct]
    
    # remove punctuation tokens that are in the word string like 'bye!' -> 'bye'
    REPLACE_PUNCT = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
    text_no_punct = [REPLACE_PUNCT.sub("", tok.text) for tok in text_no_punct]
    
  2. POS tagging - short for Part-Of-Speech tagging. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example:

    A/DT Part-Of-Speech/NNP Tagger/NNP is/VBZ a/DT piece/NN of/IN
    software/NN that/WDT reads/VBZ text/NN in/IN some/DT
    language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO
    each/DT word/NN ,/, such/JJ as/IN noun/NN ,/, verb/NN ,/,
    adjective/NN ,/, etc./FW./.
    

where the uppercase codes after the slash are a standard word tags. A list of tags can be found here

In SpaCy, this is already done by putting the text into nlp instance. You can get the tags with:

    for token in doc:
        print(token.text, token.tag_)
  1. Morphological processing: lemmatization - it's a process of transforming the words into a linguistically valid base form, called the lemma:

    nouns → singular nominative form
    verbs → infinitive form
    adjectives → singular, nominative, masculine, indefinitive, positive form
    

In SpaCy, it's also already done for you by putting the text into nlp instance. You can get the lemma of every word by:

    for token in doc:
        print(token.text, token.lemma_)
  1. Removing stopwords - stopwords are the words that are not bringing any new information or meaning to the sentence and can be omitted. You guessed, it's also already done for you by nlp instance. To filter the stopwords just type:

    text_without_stopwords = [token.text for token in doc if not token.is_stop]
    doc = nlp(' '.join(text_without_stopwords))
    

Now you have a clean dataset. You can now use word2vec or GloVe pretrained models to create a word vectors and input your data to some model. Alternatively, you can use TF-IDF for creating the word vectors by removing the most common words. Also, contrary to the usual process, you may want to leave the most specific words as your task is to better differentiate between two texts. I hope this is clear enough :)

like image 163
Novak Avatar answered Sep 27 '22 22:09

Novak