Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unsupervised Named entity recognition (NER) with custom controlled vocabulary for crosslink-suggestions in Java

I'm looking for a Java library that can do Named entity recognition (NER) with a custom controlled vocabulary, without needing labeled training data first. I searched some on SE, but most questions are rather unspecific.

Consider the following use-case:

  • an editor is inputting articles in a CMS (about 500 words).
  • the text may contain references (in plain text) to entities of a specific domain. e.g:
    • names of points of interest, like bars, restaurants, as well as neighborhoods, etc.
  • a controlled vocabulary of these entities exist (about 5.000 entities) .
    • I imagine an entity to be a -tuple in the vocabulary
  • after finishing the text, the user should be able to save the document.
  • This triggers the workflow to scan the piece of text against the vocabulary, by comparing against the name of the entity. It's not required to have a 100% match: 97% on Jarao-winkler or whatever (I'm not familiar with what algo's NER uses) may be enough, I need this to be configurable.
  • Hits are returned to the controller server-side. This in return returns JSON to the client containing of the entities, which are represented as suggested crosslinks to the editor.

Ideally, I'm looking for a project that uses NRE to suggests crosslinks within a CMS-environment to piggyback on. (I'm sure plugins for wordpress exist for example) not so sure if something similar exists in Java.

All other more general pointers to NRE-libraries which work with controlled custom vocabularies are welcome as well.

like image 409
Geert-Jan Avatar asked Oct 05 '11 15:10

Geert-Jan


1 Answers

For people looking this up in the future:

"Approximate Dictionary-Based Chunking" see: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

(URL edited.)

like image 160
Geert-Jan Avatar answered Sep 30 '22 14:09

Geert-Jan