Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What features do NLP practitioners use to pick out English names?

Tags:

nlp

nltk

I am trying named entity recognition for the first time. I'm looking for features that will pick out English names. I am using the methods outlined in the coursera nlp course (week three) and the nltk book. In other words: I am defining features, identifying features of words and then running those words/features through a classifier that I train on labeled data.

What features are used to pick out English names?

I can imagine that you'd look for two capital words in a row, or a capital word and then an initial and then a capital word. (ex. John Smith or James P. Smith).

But what other features are used for NER?

like image 209
bernie2436 Avatar asked May 16 '14 19:05

bernie2436


2 Answers

Some common features:

  • Word lists for common names (John, Adam, etc)
  • casing
  • contains symbol or numeric characters (names generally don't)
  • person prefixes (Mr., Mrs., etc...)
  • person postfixes (Jr., Sr., etc...)
  • single letter abbreviation (ie, (J.) Smith).
  • analysis of surrounding words (you may find some words have a high probability of appearing near names).
  • Named Entities previously recognized (often it is easy to identify NE in some parts of the corpus based on context, but very hard in other parts. If previously identified, this is an excellent hint towards NER)

Depending what language you are working with there may be more language specific features as well. Frankly you can turn up a wealth of information with a simple Google query, I'm really not sure why you haven't turned there. Some starting points however:

  • Google
  • A survey of named entity recognition and classification
  • Named entity recognition without gazetteers
like image 197
sooniln Avatar answered Nov 17 '22 22:11

sooniln


I had done something similar back in school using machine learning. I suppose that you will use a supervised algorithm and you will classify every single word independently and not words in combination. In that case I would choose some features for the word itself like the ones you mentioned (if the word begins with a capital letter, if the word is an abbreviation) but I would add some more features like if the previous or the next words also start from a capital letter, or if they are abbreviations. This way you can add some context and overcome the problems related to your basic independence assumption.

If you want have a look here. In the machine learning section you can find some more information and examples (the problem is slightly different but the method should be similar).

Whatever features you choose it is important that you use some measure to evaluate their relevance and possibly reduce them to the useful ones to avoid over-fitting. One of the measures you can use to evaluate them is the gain ratio but there are many more. Here you can find some basic information about feature extraction.

Hope it helps!

like image 27
Aspasia Avatar answered Nov 17 '22 23:11

Aspasia