Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unstructured Text to Structured Data

I am looking for references (tutorials, books, academic literature) concerning structuring unstructured text in a manner similar to the google calendar quick add button.

I understand this may come under the NLP category, but I am interested only in the process of going from something like "Levi jeans size 32 A0b293"

to: Brand: Levi, Size: 32, Category: Jeans, code: A0b293

I imagine it would be some combination of lexical parsing and machine learning techniques.

I am rather language agnostic but if pushed would prefer python, Matlab or C++ references

Thanks

like image 233
zenna Avatar asked Jul 01 '10 23:07

zenna


People also ask

Is text structured or unstructured data?

Text is commonly referred to as unstructured data. Prior to textual disambiguation, text did not fit comfortably into a standard database management system.


2 Answers

You need to provide more information about the source of the text (the web? user input?), the domain (is it just clothes?), the potential formatting and vocabulary...

Assuming worst case scenario you need to start learning NLP. A very good free book is the documentation of NLTK: http://www.nltk.org/book . It is also a very good introduction to Python and the SW is free (for various usages). Be warned: NLP is hard. It doesn't always work. It is not fun at times. The state of the art is no where near where you imagine it is.

Assuming a better scenario (your text is semi-structured) - a good free tool is pyparsing. There is a book, plenty of examples and the resulting code is extremely attractive.

I hope this helps...

like image 152
Tal Weiss Avatar answered Sep 27 '22 20:09

Tal Weiss


Possibly look at "Collective Intelligence" by Toby Segaran. I seem to remember that addressing the basics of this in one chapter.

like image 20
leancz Avatar answered Sep 27 '22 21:09

leancz