Efficiently extract WikiData entities from text

Question

I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.

The problem is simple: How to extract every WikiData entity from a given text?

An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.

So far I've tried the following:

Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
Apply Porter Stemming where applicable.
Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.

This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.

Suggestions?

I tagged the question Scala because I'm using Spark for the task.

Tom Morris · Accepted Answer

Some suggestions:

consider Stanford NER in comparison to OpenNLP to see how it compares on your corpus
I wonder at the value of stemming for most entity names
I suspect you might be losing information by dividing the task into discrete stages
although Wikidata is new, the task isn't, so you might look at papers for Freebase|DBpedia|Wikipedia entity recognition|disambiguation

In particular, DBpedia Spotlight is one system designed for exactly this task.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

Efficiently extract WikiData entities from text

Tags:

machine-learning

scala

information-retrieval

wikidata

wikidata-api

habitats

1 Answers

Tom Morris

Recent Activity

Donate For Us

Efficiently extract WikiData entities from text

Tags:

machine-learning

scala

information-retrieval

wikidata

wikidata-api

habitats

1 Answers

Tom Morris

Related questions

Recent Activity

Donate For Us