Train Spacy on unlabeled text corpus to extract "important phrases"

Question

I'm looking to find a way to extract "important phrases" from text documents. Was hoping to do this using Spacy, but there is one caveat: my data contains mostly product information and therefore the important phrases are different from what they would be in natural spoken language. For this reason, I would like to train spacy on my own corpus, but the only info I can find is for training spacy using labeled data.

Does anyone know if what I want to do is possible?

Branden Ciranni · Accepted Answer

If you are looking for a scheme to weight phrases according to "Importance" without any labeled data, you can try using TF-IDF.

For this answer, I will refer to terms - these can be phrases or words. It just represents a single entity of text.

A Brief Look at TF-IDF

TF-IDF stands for (Term Frequency) x (Inverse Document Frequency).
It is a measure of how often a term appears in a single document vs. how often that term appears across the entire corpus of documents.
It is commonly used as a statistical measure to determine how important terms are in a corpus.
For a longer, but readable explanation of it, check out the wiki: https://en.wikipedia.org/wiki/Tf%E2%80%93idf.

Code Implementation

Check out Scikit-Learn's TfidfVectorizer.
- This has a fit_transform function that takes raw text as an input and output the appropriate TF-IDF weights for words and/or n-grams.
- If you prefer to do your own tokenization with spaCy, or only include doc.noun_chunks and doc.ents that satisfy len(span) >= 2 (i.e. phrases), there is a little hack for the TfidfVectorizer.
- To use your own tokenization, do the following:
```
dummy = lambda x: x

vectorizer = TfidfVectorizer(analyzer=dummy)
tfidf = vectorizer.fit_transform(list_of_tokenized_docs)
```
  This overrides the default tokenization and lets you use your own list of tokens.

From there you can find the terms that have the highest average TF-IDF score across all documents, and consider those as Important. You can try using those as input to the PhraseMatcher: https://spacy.io/usage/rule-based-matching#phrasematcher.

Or you can find some way to use these to automatically label documents. If you can locate them in your documents after determining they are important, you can then add an appropriate label and use that as training data to some training pipeline.

Train Spacy on unlabeled text corpus to extract "important phrases"

Tags:

python

nlp

spacy

Muriel

1 Answers

Branden Ciranni

Recent Activity

Donate For Us

Train Spacy on unlabeled text corpus to extract "important phrases"

Tags:

python

nlp

spacy

Muriel

1 Answers

Branden Ciranni

Related questions

Recent Activity

Donate For Us