I am looking to parse unstructured product titles like “Canon D1000 4MP Camera 2X Zoom LCD” into structured data like {brand: canon, model number: d1000, lens: 4MP zoom: 2X, display type: LCD}
.
So far I have:
-
;
:
/
)Any techniques/library/methods/algorithms would be much appreciated!
EDIT: There is no heuristic for the product titles. A seller can input anything as a title. For eg: 'Canon D1000' can just be the title. Also, this exercise is not only for camera datasets, the title can be of any product.
Structure your unstructured data Data preparation techniques like tokenization, part-of-speech tagging, stemming, and lemmatization effectively transform unstructured text into a format that can be understood by machines.
I agree there is no 100% success method. A possible approach would be to train a custom NER(Named Entity Recognition) with some manually annotated data. The labels would be: BRAND/MODEL/TYPE. Also a common way to filter model names/brands is to use a dictionary. Brands/models usually are non-dictionary words.
Having developed a commercial analyzer of this kind, I can tell you that there is no easy solution for this problem. But there are multiple shortcuts, especially if your domain is limited to cameras/electronics.
Firstly, you should look at more sites. Many have product brand annotated in the page (proper html annotations, bold font, all caps in the beginning of the name). Some sites have entire pages with brand selectors for search purposes. This way you can create a pretty good starter dictionary of brand names. Same with product line names and even with models. Alphanumeric models can be extracted in bulk by regular expressions and filtered pretty quickly.
There are plenty of other tricks, but I'll try to be brief. Just a piece of advice here: there is always a trade-off between manual work and algorithms. Always keep in mind that both approaches can be mixed and both have return-on-invested-time curves, which people tend to forget. If your goal is not to create an automatic algorithm to extract product brands and models, this problem should have limited time budget in your plan. You can realistically create a dictionary of 1000 brands in a day, and for decent performance on known data source of electronic goods (we are not talking Amazon here or are we?) a dictionary of 4000 brands may be all you need for your work. So do the math before you invest weeks into the latest neural network named entity recognizer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With