Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting information from unstructured text

Tags:

nlp

nltk

I have a collection of "articles", each 1 to 10 sentences long, written in a noisy, informal english (i.e. social media style). I need to extract some information from each article, where available, like date and time. I also need to understand what the article is talking about and who is the main "actor".

Example, given the sentence: "Everybody's presence is required tomorrow morning starting from 10.30 to discuss the company's financial forecast.", I need to extract:

  • the date/time => "10.30 tomorrow morning".
  • the topic => "company's financial forecast".
  • the actor => "Everybody".

As far as I know, the date and time could be extracted without using NLP techniques but I haven't found anything as good as Natty (http://natty.joestelmach.com/) in Python.

My understanding on how to proceed after reading some chapters of the NLTK book and watching some videos of the NLP courses on Coursera is the following:

  1. Use part of the data to create an annotated corpus. I can't use off-the-shelf corpus because of the informal nature of the text (e.g. spelling errors, uninformative capitalization, word abbreviations, etc...).
  2. Manually (sigh...) annotate each article with tags from the Penn TreeBank tagset. Is there any way to automate this step and just check/fix the results ?
  3. Train a POS tagger on the annotated article. I've found the NLTK-trainer project that seems promising (http://nltk-trainer.readthedocs.org/en/latest/train_tagger.html).
  4. Chunking/Chinking, which means I'll have to manually annotate the corpus again (...) using the IOB notation. Unfortunately according to this bug report n-gram chunkers are broken: https://github.com/nltk/nltk/issues/367. This seems like a major issue, and makes me wonder whether I should keep using NLTK given that it's more than a year old.
  5. At this point, if I have done everything correctly, I assume I'll find actor, topic and datetime in the chunks. Correct ?

Could I (temporarily) skip 1,2 and 3 and produce a working, but possibly with a high error rate, implementation ? Which corpus should I use ?

I was also thinking of a pre-process step to correct common spelling mistakes or shortcuts like "yess", "c u" and other abominations. Anything already existing I can take advantage of ?

THE question, in a nutshell, is: is my approach at solving this problem correct ? If not, what am I doing wrong ?

like image 729
Trasplazio Garzuglio Avatar asked Mar 25 '14 17:03

Trasplazio Garzuglio


1 Answers

Could I (temporarily) skip 1,2 and 3 and produce a working, but possibly with a high error rate, implementation ? Which corpus should I use ?

I was also thinking of a pre-process step to correct common spelling mistakes or shortcuts like "yess", "c u" and other abominations. Anything already existing I can take advantage of ?

I would suggest you first have a go at processing standard language text. The pre-processing you refer to is an NLP task in its own right, known as normalization. Here is a resource for Twitter normalization: http://www.ark.cs.cmu.edu/TweetNLP/ , additionally, you can use spell checking, sentence boundary detection, ...

THE question, in a nutshell, is: is my approach at solving this problem correct ? If not, what am I doing wrong ?

If you make abstraction of normalization, I think your approach is valid. With regard to automating the annotation process: you can bootstrap the process by using off-the-shelf components first, after which you correct, retrain, and so on, ... during different iterations. To get acceptable results, you will need to do your steps 2, 3, and 4 a couple of times.

If you are interested in understanding the problem and being able to optimize existing solutions, I would suggest you focus on tools that allow you to develop your own models. If you prioritize getting results over being able to develop your own models, I would recommend looking into existing open source text engineering frameworks such as Gate (https://gate.ac.uk/) UIMA (http://uima.apache.org/) and DKPro (which extends UIMA) (https://code.google.com/p/dkpro-core-asl/). All three frameworks wrap existing components, so you have a wide range of possible solutions.

like image 187
jvdbogae Avatar answered Nov 17 '22 10:11

jvdbogae