Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP : Is Gazetteer a cheat

In NLP there is a concept of Gazetteer which can be quite useful for creating annotations. As far as i understand,

A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to find occurrences of these names in text, e.g. for the task of named entity recognition.

So it is essentially a lookup. Isn't this kind of a cheat? If we use a Gazetteer for detecting named entities, then there is not much Natural Language Processing going on. Ideally, i would want to detect named entities using NLP techniques. Otherwise how is it any better than a regex pattern matcher.

Does that make sense?

like image 357
AbtPst Avatar asked Jan 25 '16 14:01

AbtPst


People also ask

What is a gazetteer in NLP?

A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to find occurrences of these names in text, e.g. for the task of named entity recognition. So it is essentially a lookup.

What are gazetteer features?

It typically contains information concerning the geographical makeup, social statistics and physical features of a country, region, or continent. Content of a gazetteer can include a subject's location, dimensions of peaks and waterways, population, gross domestic product and literacy rate.

How does named entity recognition work?

How does named entity recognition work? NER scans whole text and detects named entities: It detects the sentence boundaries in a given document based on capitalization rules. Identifying the sentence boundaries will assist NER in finding and extracting relevant information from the document for the next steps.


1 Answers

Depends on how you built/use your gazetteer. If you are presenting experiments in a closed domain and you custom picked your gazetteer, then yes, you are cheating. If you are using some openly available gazetteer and performing experiments on a large dataset or using it in an application in the wild where you don't control the input then you are fine. We found ourselves in a similar situation. We partition our dataset and use the training data to automatically build our gazetteers. As long as you report your methodology you should not feel like cheating (let the reviewers complain).

like image 200
Josep Valls Avatar answered Sep 18 '22 16:09

Josep Valls