I'm looking to find a way to extract "important phrases" from text documents. Was hoping to do this using Spacy, but there is one caveat: my data contains mostly product information and therefore the important phrases are different from what they would be in natural spoken language. For this reason, I would like to train spacy on my own corpus, but the only info I can find is for training spacy using labeled data.
Does anyone know if what I want to do is possible?
If you are looking for a scheme to weight phrases according to "Importance" without any labeled data, you can try using TF-IDF.
For this answer, I will refer to terms - these can be phrases or words. It just represents a single entity of text.
A Brief Look at TF-IDF
Code Implementation
This has a fit_transform
function that takes raw text as an input and output the appropriate TF-IDF weights for words and/or n-grams.
If you prefer to do your own tokenization with spaCy, or only include doc.noun_chunks
and doc.ents
that satisfy len(span) >= 2
(i.e. phrases), there is a little hack for the TfidfVectorizer
.
To use your own tokenization, do the following:
dummy = lambda x: x
vectorizer = TfidfVectorizer(analyzer=dummy)
tfidf = vectorizer.fit_transform(list_of_tokenized_docs)
This overrides the default tokenization and lets you use your own list of tokens.
From there you can find the terms that have the highest average TF-IDF score across all documents, and consider those as Important. You can try using those as input to the PhraseMatcher: https://spacy.io/usage/rule-based-matching#phrasematcher.
Or you can find some way to use these to automatically label documents. If you can locate them in your documents after determining they are important, you can then add an appropriate label and use that as training data to some training pipeline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With