Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently extract WikiData entities from text

I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.

The problem is simple: How to extract every WikiData entity from a given text?

An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.

So far I've tried the following:

  1. Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
  2. Apply Porter Stemming where applicable.
  3. Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.

This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.

Suggestions?

I tagged the question Scala because I'm using Spark for the task.

like image 435
habitats Avatar asked Feb 03 '16 23:02

habitats


1 Answers

Some suggestions:

  • consider Stanford NER in comparison to OpenNLP to see how it compares on your corpus
  • I wonder at the value of stemming for most entity names
  • I suspect you might be losing information by dividing the task into discrete stages
  • although Wikidata is new, the task isn't, so you might look at papers for Freebase|DBpedia|Wikipedia entity recognition|disambiguation

In particular, DBpedia Spotlight is one system designed for exactly this task.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

like image 198
Tom Morris Avatar answered Oct 01 '22 02:10

Tom Morris