Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How apache UIMA is different from Apache Opennlp

Tags:

nlp

opennlp

uima

I have been doing some capability testing with Apache OpenNLP, Which has the capability to Sentence detection, Tokenization, Name entity recognition. Now when i started looking at UIMA documents it is mentioned on the UIMA home page - "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)".

Which says that i can use UIMA to do the same task as done by OpenNLP. What added feature both have ? I am new to this area, Please help me to understand the uses and capability perspective of both.

like image 939
vashishth Avatar asked May 19 '15 06:05

vashishth


People also ask

Is Apache open NLP free?

That means you can run all the lessons at course.fast.ai, and do your own research and development with fastai, all for free.


1 Answers

As I understand the question, you are asking for the differences between the feature sets of Apache UIMA and Apache OpenNLP. Their feature sets barely have anything in common as these two projects have very different aims.

Apache UIMA is an open source implementation of the UIMA specification. The latter defines a conceptual framework for augmenting unstructured information (such as natural language produced by humans) with structured metadata so that computers can work with it.

As an example for an application working with unstructured information, let us take a an application that takes natural language text as input and marks all named entities in the given text, e.g.

Input text = "Bob's cat Charlie is chasing a mouse."
Result = "<NE>Bob</NE>'s cat <NE>Charlie</NE> is chasig a mouse."

To identify the named entities in this example (i.e. Bob and Charlie), several steps of natural language processing have to be performed. Without going into detail about what each of the steps does, a hypothetical system for named entity recognition might involve the following steps:

  1. Data preparation
  2. Sentence splitting
  3. Tokenization
  4. Token lemmatization
  5. Part-of-speech tagging
  6. Phrase detection
  7. Classifying phrases as named entities or not

As you can see, such applications can be very intuitively modelled as sequences of components, and this is exactly what UIMA does. It models applications dealing with unstructed information as pipelines of components (called analytics in UIMA parlance). As you can imagine, many of the pipeline components listed above can be used for other tasks and so the architecture design of UIMA emphasizes reusability of components.

To avoid confusion, the UIMA standard itself doesn't provide any specific components, but defines an infrastructure for UIM (Unstructured Information Management) applications, e.g. workflows, data types, inter-component communication, and so on.

Apache OpenNLP on the other hand does exactly that, namely provide concrete implementations of NLP algorithms dealing with very specific tasks (sentence splitting, POS-tagging, etc.). The source of your confusion might be that it is possible to write Apache UIMA components that wrap OpenNLP tools. The OpenNLP project actually provides such components.

Whether you want to use the UIMA framework for your UIM applications depends on the size of the project. If it is small, I would go without UIMA and just use OpenNLP directly, as UIMA is rather heavy-weight and thus only adds complex yet (for small applications) unnecessary overhead. Also, due to its complexity, it takes a good amount of time to learn how to use it.

Summing up, Apache UIMA and Apache OpenNLP solve different problems, but since both deal with unstructured information, they can be combined profitably.

like image 75
zepp133 Avatar answered Sep 30 '22 16:09

zepp133