Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Entity Extraction/Recognition with free tools while feeding Lucene Index

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.

E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.

  • Dbpedia Spotlight, the demo looks very promising
  • OpenNLP requires training. Which training data to use?
  • OpenNLP tools
  • Stanbol
  • NLTK
  • balie
  • UIMA
  • GATE -> example code
  • Apache Mahout
  • Stanford CRF-NER
  • maui-indexer
  • Mallet
  • Illinois Named Entity Tagger Not open source but free
  • wikipedianer data

My questions:

  • Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
  • Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
  • How can they be integrated with Lucene?

Here are some questions related to that subject:

  • Does an algorithm exist to help detect the "primary topic" of an English sentence?
  • Named Entity Recognition Libraries for Java
  • Named entity recognition with Java
like image 650
Karussell Avatar asked Sep 17 '11 13:09

Karussell


2 Answers

The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

  • Zemanta
  • Maui-indexer
  • Dbpedia Spotlight
  • Extractiv (my company)

These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

like image 172
John Lehmann Avatar answered Nov 08 '22 11:11

John Lehmann


You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/

For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind

like image 36
Abul Fayes Avatar answered Nov 08 '22 11:11

Abul Fayes