NLP : Is Gazetteer a cheat

Tags:

named-entity-recognition

In NLP there is a concept of Gazetteer which can be quite useful for creating annotations. As far as i understand,

A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to ﬁnd occurrences of these names in text, e.g. for the task of named entity recognition.

So it is essentially a lookup. Isn't this kind of a cheat? If we use a Gazetteer for detecting named entities, then there is not much Natural Language Processing going on. Ideally, i would want to detect named entities using NLP techniques. Otherwise how is it any better than a regex pattern matcher.

Does that make sense?

357

asked Jan 25 '16 14:01

AbtPst

1 Answers

Depends on how you built/use your gazetteer. If you are presenting experiments in a closed domain and you custom picked your gazetteer, then yes, you are cheating. If you are using some openly available gazetteer and performing experiments on a large dataset or using it in an application in the wild where you don't control the input then you are fine. We found ourselves in a similar situation. We partition our dataset and use the training data to automatically build our gazetteers. As long as you report your methodology you should not feel like cheating (let the reviewers complain).

200

answered Sep 18 '22 16:09

Josep Valls

Related questions
                            
                                Celery message queue vs AWS Lambda task processing
                            
                                Replace entity with its label in SpaCy
                            
                                nltk StanfordNERTagger : NoClassDefFoundError: org/slf4j/LoggerFactory (In Windows)
                            
                                How to know when to use a particular kind of Similarity index? Euclidean Distance vs. Pearson Correlation
                            
                                How to recreate same DocumentTermMatrix with new (test) data
                            
                                How to find ngram frequency of a column in a pandas dataframe?
                            
                                Java text classification problem [closed]
                            
                                Sparse Efficiency Warning while changing the column
                            
                                Replace single quotes with double with exclusion of some elements
                            
                                How to check if given word is in plural or singular form?
                            
                                How can i cluster document using k-means (Flann with python)?
                            
                                Executing and testing stanford core nlp example
                            
                                Word frequencies from strings in Postgres?
                            
                                Minimum Edit Distance Reconstruction
                            
                                How much space and processing will be optimized in Lucene index by storing a field as Byte instead of String for billions of documents
                            
                                Base word stemming instead of root word stemming in R
                            
                                Probability tree for sentences in nltk employing both lookahead and lookback dependencies
                            
                                How to encode dependency path as a feature for classification?
                            
                                Spelling correction for person names (Python)
                            
                                How Can I Add More Languages to Stopwords in NLTK?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With