I have a list of products that I am trying to classify into categories. They will be described with incomplete sentences like:
"Solid State Drive Housing"
"Hard Drive Cable"
"1TB Hard Drive"
"500GB Hard Drive, Refurbished from Manufacturer"
How can I use python and NLP to get an output like "Housing, Cable, Drive, Drive", or a tree that describes which word is modifying which? Thank you in advance
NLP techniques are relatively ill equipped to deal with this kind of text.
Phrased differently: it is quite possible to build a solution which includes NLP processes to implement the desired classifier but the added complexity doesn't necessarily pays off in term of speed of development nor classifier precision improvements.
If one really insists on using NLP techniques, POS-tagging and its ability to identify nouns is the most obvious idea, but Chunking and access to WordNet or other lexical sources are other plausible uses of NLTK.
Instead, an ad-hoc solution based on simple regular expressions and a few heuristics such as these suggested by NoBugs is probably an appropriate approach to the problem. Certainly, such solutions bear two main risks:
Running some plain statical analysis on the complete (or very big sample) of the texts to be considered should help guide the selection of a few heuristics and also avoid the over-fitting concerns. I'm quite sure that a relatively small number of rules, associated with a custom dictionary should be sufficient to produce a classifier with appropriate precision as well as speed/resources performance.
A few ideas:
I'm afraid this answer falls short of providing Python/NLTK snippets as a primer towards a solution, but frankly such simple NLTK-based approaches are likely to be disappointing at best. Also, we should have a much bigger sample set of the input text to guide the selection of plausible approaches, include ones that are based on NLTK or NLP techniques at large.
pip install spacy
python -m spacy download en import spacy
nlp = spacy.load('en')
sent = "INCOMEPLETE SENTENCE HERE"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]
Examples:
sent = "Solid State Drive Housing"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]
output: [Housing]
sent = "Hard Drive Cable"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]
output: [Cable]
sent = "1TB Hard Drive"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]
output: [Drive]
sent = "500GB Hard Drive, Refurbished from Manufacturer"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "ROOT") ]
output: [Drive]
I would create a list of nouns, either manually, with all nouns you're looking for, or parse a dictionary such as this one. Filtering all but the nouns would at least get you to "State Drive", "Drive Cable", or "Drive", ignoring everything after the first punctuation mark.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With