Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse product titles (unstructured) into structured data?

I am looking to parse unstructured product titles like “Canon D1000 4MP Camera 2X Zoom LCD” into structured data like {brand: canon, model number: d1000, lens: 4MP zoom: 2X, display type: LCD}.

So far I have:

  1. Removed stopwords and cleaned up (remove characters like - ; : /)
  2. Tokenizing long strings into words.

Any techniques/library/methods/algorithms would be much appreciated!

EDIT: There is no heuristic for the product titles. A seller can input anything as a title. For eg: 'Canon D1000' can just be the title. Also, this exercise is not only for camera datasets, the title can be of any product.

like image 440
stealthspy Avatar asked Aug 28 '13 19:08

stealthspy


People also ask

Can you structure unstructured data?

Structure your unstructured data Data preparation techniques like tokenization, part-of-speech tagging, stemming, and lemmatization effectively transform unstructured text into a format that can be understood by machines.


2 Answers

I agree there is no 100% success method. A possible approach would be to train a custom NER(Named Entity Recognition) with some manually annotated data. The labels would be: BRAND/MODEL/TYPE. Also a common way to filter model names/brands is to use a dictionary. Brands/models usually are non-dictionary words.

like image 25
bogs Avatar answered Jan 03 '23 17:01

bogs


Having developed a commercial analyzer of this kind, I can tell you that there is no easy solution for this problem. But there are multiple shortcuts, especially if your domain is limited to cameras/electronics.

Firstly, you should look at more sites. Many have product brand annotated in the page (proper html annotations, bold font, all caps in the beginning of the name). Some sites have entire pages with brand selectors for search purposes. This way you can create a pretty good starter dictionary of brand names. Same with product line names and even with models. Alphanumeric models can be extracted in bulk by regular expressions and filtered pretty quickly.

There are plenty of other tricks, but I'll try to be brief. Just a piece of advice here: there is always a trade-off between manual work and algorithms. Always keep in mind that both approaches can be mixed and both have return-on-invested-time curves, which people tend to forget. If your goal is not to create an automatic algorithm to extract product brands and models, this problem should have limited time budget in your plan. You can realistically create a dictionary of 1000 brands in a day, and for decent performance on known data source of electronic goods (we are not talking Amazon here or are we?) a dictionary of 4000 brands may be all you need for your work. So do the math before you invest weeks into the latest neural network named entity recognizer.

like image 182
Alex Nevidomsky Avatar answered Jan 03 '23 18:01

Alex Nevidomsky