NLP, spaCy: Strategy for improving document similarity

Tags:

One sentence backdrop: I have text data from auto-transcribed talks, and I want to compare their similarity of their content (e.g. what they are talking about) to do clustering and recommendation. I am quite new to NLP.

Data: The data I am using is available here. For all the lazy ones

clone https://github.com/TMorville/transcribed_data

and here is a snippet of code to put it in a df:

import os, json
import pandas as pd

from pandas.io.json import json_normalize 

def td_to_df():
    
    path_to_json = '#FILL OUT PATH'
    json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('td.json')]

    tddata = pd.DataFrame(columns=['trans', 'confidence'])

    for index, js in enumerate(json_files):
        with open(os.path.join(path_to_json, js)) as json_file:
            json_text = json_normalize(json.load(json_file))

            tddata['trans'].loc[index] = str(json_text['trans'][0])
            tddata['confidence'].loc[index] = str(json_text['confidence'][0])

    return tddata

Approach: So far, I have only used the spaCy package to do "out of the box" similarity. I simply apply the nlp model on the entire corpus of text, and compare it to all others.

def similarity_get():
    
    tddata = td_to_df()
    
    nlp = spacy.load('en_core_web_lg')
    
    baseline = nlp(tddata.trans[0])
    
    for text in tddata.trans:
        print (baseline.similarity(nlp(text)))

Problem: Practically all similarities comes out as > 0.95. This is more or less independent of the baseline. Now, this may not come a major surprise given the lack of preprocessing.

Solution strategy: Following the advice in this post, I would like to do the following (using spaCy where possible): 1) Remove stop words. 2) Remove most frequent words. 3) Merge word pairs. 4) Possibly use Doc2Vec outside of spaCy.

Questions: Does the above seem like a sound strategy? If no, what's missing? If yes, how much of this already happening under the hood by using the pre-trained model loaded in nlp = spacy.load('en_core_web_lg')?

I can't seem find the documentation that demonstrates what exactly these models are doing, or how to configure it. A quick google search yields nothing and even the, very neat, api documentation does not seem to help. Perhaps I am looking in the wrong place?

903

asked Jun 07 '18 14:06

tmo

1 Answers

You can do most of that with SpaCY and some regexes.

So, you have to take a look at the SpaCY API documentation.

Basic steps in any NLP pipeline are the following:

Language detection (self explanatory, if you're working with some dataset, you know what the language is and you can adapt your pipeline to that). When you know a language you have to download a correct models from SpaCY. The instructions are here. Let's use English for this example. In your command line just type python -m spacy download en and then import it to the preprocessing script like this:
```
import spacy
nlp = spacy.load('en')
```
Tokenization - this is the process of splitting the text into words. It's not enough to just do text.split() (ex. there's would be treated as a single word but it's actually two words there and is). So here we use Tokenizers. In SpaCy you can do something like:
```
nlp_doc = nlp(text)
```

where text is your dataset corpus or a sample from a dataset. You can read more about the document instance here

Punctuation removal - pretty self explanatory process, done by the method in the previous step. To remove punctuation, just type:

import re

# removing punctuation tokens
text_no_punct = [token.text for token in doc if not token.is_punct]

# remove punctuation tokens that are in the word string like 'bye!' -> 'bye'
REPLACE_PUNCT = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])")
text_no_punct = [REPLACE_PUNCT.sub("", tok.text) for tok in text_no_punct]

POS tagging - short for Part-Of-Speech tagging. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example:

A/DT Part-Of-Speech/NNP Tagger/NNP is/VBZ a/DT piece/NN of/IN
software/NN that/WDT reads/VBZ text/NN in/IN some/DT
language/NN and/CC assigns/VBZ parts/NNS of/IN speech/NN to/TO
each/DT word/NN ,/, such/JJ as/IN noun/NN ,/, verb/NN ,/,
adjective/NN ,/, etc./FW./.

where the uppercase codes after the slash are a standard word tags. A list of tags can be found here

In SpaCy, this is already done by putting the text into nlp instance. You can get the tags with:

    for token in doc:
        print(token.text, token.tag_)

Morphological processing: lemmatization - it's a process of transforming the words into a linguistically valid base form, called the lemma:
```
nouns → singular nominative form
verbs → infinitive form
adjectives → singular, nominative, masculine, indefinitive, positive form
```

In SpaCy, it's also already done for you by putting the text into nlp instance. You can get the lemma of every word by:

    for token in doc:
        print(token.text, token.lemma_)

Removing stopwords - stopwords are the words that are not bringing any new information or meaning to the sentence and can be omitted. You guessed, it's also already done for you by nlp instance. To filter the stopwords just type:
```
text_without_stopwords = [token.text for token in doc if not token.is_stop]
doc = nlp(' '.join(text_without_stopwords))
```

Now you have a clean dataset. You can now use word2vec or GloVe pretrained models to create a word vectors and input your data to some model. Alternatively, you can use TF-IDF for creating the word vectors by removing the most common words. Also, contrary to the usual process, you may want to leave the most specific words as your task is to better differentiate between two texts. I hope this is clear enough :)

163

answered Sep 27 '22 22:09

Novak

Related questions
                            
                                How to recognize entities in text that is the output of optical character recognition (OCR)?
                            
                                What are the inputs to the transformer encoder and decoder in BERT?
                            
                                Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory
                            
                                Document Layout Analysis for text extraction
                            
                                Extracting nouns from Noun Phase in NLP
                            
                                How can I tweak Levenshtein distance in classifying linguistically similar words (e.g. verb tenses, adjective comparisons, singular and plural)
                            
                                C++ Sentiment Analysis Library [closed]
                            
                                Intelligent spell checking
                            
                                Interesting NLP/machine-learning style project -- analyzing privacy policies
                            
                                How google recognises 2 words without spaces?
                            
                                Counting with scipy.sparse
                            
                                How do I use the book functions (e.g. concoordance) in NLTK?
                            
                                What does the dependency-parse output of TurboParser mean?
                            
                                how to automatically detect acronym meaning / extension
                            
                                Sentence annotation in text without punctuation
                            
                                Chunking NP, VP and PP phrases in Java (CoreNLP)
                            
                                Sentence tokenization for texts that contains quotes
                            
                                NLTK - WordNet: list of long words
                            
                                How to extract Predicate and subject from a sentence using NLP Libraries?
                            
                                Using predict on new text with kmeans (sklearn)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NLP, spaCy: Strategy for improving document similarity

Tags:

nlp

similarity

spacy

tmo

People also ask

1 Answers

Novak

Recent Activity

Donate For Us