Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NER naive algorithm

Tags:

python

nlp

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.

The idea was to extract names of the main characters in a story through:

  1. Building a dictionary for each word
  2. Filling for each word a list with the words that appear right next to it in the text
  3. Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
  4. Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)

I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:

21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']

Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.

It does not work that well with other characters and in other stories, though gives interesting results.

Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?

Thanks!

#English Name recognition
import re
import sys
import random
from string import upper

def mimic_dict(filename):
  dict = {}
  f = open(filename)
  text = f.read()
  f.close()
  prev = ""
  words = text.split()
  for word in words:
    m = re.search("\w+",word)
    if m == None:
      continue
    word = m.group()
    if not prev in dict:
      dict[prev] = [word]
    else :
      dict[prev] = dict[prev] + [word] 
    prev = word
  return dict

def main():
  if len(sys.argv) != 2:
    print 'usage: ./main.py file-to-read'
    sys.exit(1)

  dict = mimic_dict(sys.argv[1])
  upper = []
  for e in dict.keys():
    if len(e) > 1 and  e[0].isupper():
      upper.append(e)
  print len(upper),upper

  exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
  exclude = [ x  for x in exclude if dict.has_key(x)] 
  for s in exclude :
    del dict[s]

  scores = {}
  for key1 in dict.keys():
    max = 0
    for key2 in dict.keys():
      if key1 == key2 : continue
      a =  dict[key1]
      k =  dict[key2]
      diff = []
      for ia in a:
        if ia in k and ia not in diff:
          diff.append( ia)
      if len(diff) > max:
        max = len(diff)
        scores[key1]=(key2,max)
  dictscores = {}
  names = []
  for e in scores.keys():
    if scores[e][0]=="Alice" and e[0].isupper():
      names.append(e)
  print len(names), names     


if __name__ == '__main__':
  main()
like image 790
ariy Avatar asked Dec 28 '22 00:12

ariy


2 Answers

From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".

The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.

Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect

[ORG Microsoft] CEO [PER Steve Ballmer]
like image 64
Fred Foo Avatar answered Jan 08 '23 02:01

Fred Foo


What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.

Have a look at the following papers to get an idea of what people do with distributional semantics:

  • Lin 1998
  • Grefenstette 1994
  • Schuetze 1998

To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).

As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

like image 38
mbatchkarov Avatar answered Jan 08 '23 01:01

mbatchkarov