Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Train Spacy on unlabeled text corpus to extract "important phrases"

Tags:

python

nlp

spacy

I'm looking to find a way to extract "important phrases" from text documents. Was hoping to do this using Spacy, but there is one caveat: my data contains mostly product information and therefore the important phrases are different from what they would be in natural spoken language. For this reason, I would like to train spacy on my own corpus, but the only info I can find is for training spacy using labeled data.

Does anyone know if what I want to do is possible?

like image 592
Muriel Avatar asked Nov 06 '22 02:11

Muriel


1 Answers

If you are looking for a scheme to weight phrases according to "Importance" without any labeled data, you can try using TF-IDF.

For this answer, I will refer to terms - these can be phrases or words. It just represents a single entity of text.

A Brief Look at TF-IDF


  • TF-IDF stands for (Term Frequency) x (Inverse Document Frequency).
  • It is a measure of how often a term appears in a single document vs. how often that term appears across the entire corpus of documents.
  • It is commonly used as a statistical measure to determine how important terms are in a corpus.
  • For a longer, but readable explanation of it, check out the wiki: https://en.wikipedia.org/wiki/Tf%E2%80%93idf.

Code Implementation


  • Check out Scikit-Learn's TfidfVectorizer.
    • This has a fit_transform function that takes raw text as an input and output the appropriate TF-IDF weights for words and/or n-grams.

    • If you prefer to do your own tokenization with spaCy, or only include doc.noun_chunks and doc.ents that satisfy len(span) >= 2 (i.e. phrases), there is a little hack for the TfidfVectorizer.

    • To use your own tokenization, do the following:

      dummy = lambda x: x
      
      vectorizer = TfidfVectorizer(analyzer=dummy)
      tfidf = vectorizer.fit_transform(list_of_tokenized_docs)
      

      This overrides the default tokenization and lets you use your own list of tokens.

From there you can find the terms that have the highest average TF-IDF score across all documents, and consider those as Important. You can try using those as input to the PhraseMatcher: https://spacy.io/usage/rule-based-matching#phrasematcher.

Or you can find some way to use these to automatically label documents. If you can locate them in your documents after determining they are important, you can then add an appropriate label and use that as training data to some training pipeline.

like image 95
Branden Ciranni Avatar answered Nov 12 '22 17:11

Branden Ciranni