Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

From of list of strings, identify which are human names and which are not

I have a vector like the one below and would like to determine which elements in the list are human names and which are not. I found the humaniformat package, which formats names but unfortunately does not determine if a string is in fact a name. I also found a few packages for entity extraction, but they seem to require actual text for part-of-speech tagging, rather than a single name.

Example

pkd.names.quotes <- c("Mr. Rick Deckard", # Name
                      "Do Androids Dream of Electric Sheep", # Not a name
                      "Roy Batty", # Name 
                      "How much is an electric ostrich?", # Not a name
                      "My schedule for today lists a six-hour self-accusatory depression.", # Not a name
                      "Upon him the contempt of three planets descended.", # Not a name
                      "J.F. Sebastian", # Name
                      "Harry Bryant", # Name
                      "goat class", # Not a name
                      "Holden, Dave", # Name
                      "Leon Kowalski", # Name
                      "Dr. Eldon Tyrell") # Name
like image 637
Henry David Thorough Avatar asked Sep 13 '15 00:09

Henry David Thorough


1 Answers

Here is one approach. The US Census Bureau tabulates a list of surnames occurring > 100 times in its database (with frequency): all 152,000 of them. If you use the full list, all of your strings have a name. For instance, "class", "him" and "the" are names in certain languages (not sure which languages though). Similarly, there are many lists of first names (see this post).

The code below grabs all the surnames from the 2000 Census, and a list of first names from the post cited, then subsets to the most common 10,000 on each list, combines and cleans the lists, and uses that as a dictionary in the tm package to identify which strings contain names. You can control the "sensitivity" by altering the freq variable (freq=10,000 seems to generate the result you want).

url <- "http://www2.census.gov/topics/genealogy/2000surnames/names.zip"
tf <- tempfile()
download.file(url,tf, mode="wb")                     # download archive of surname data
files    <- unzip(tf, exdir=tempdir())               # unzips and returns a vector of file names
surnames <- read.csv(files[grepl("\\.csv$",files)])  # 152,000 surnames occurring >100 times
url <- "http://deron.meranda.us/data/census-derived-all-first.txt"
firstnames <- read.table(url(url), header=FALSE)
freq <- 10000
dict  <- unique(c(tolower(surnames$name[1:freq]), tolower(firstnames$V1[1:freq])))
library(tm)
corp <- Corpus(VectorSource(pkd.names.quotes))
tdm  <- TermDocumentMatrix(corp, control=list(tolower=TRUE, dictionary=dict))
m    <- as.matrix(tdm)
m    <- m[rowSums(m)>0,]
m
#            Docs
# Terms       1 2 3 4 5 6 7 8 9 10 11 12
#   bryant    0 0 0 0 0 0 0 1 0  0  0  0
#   dave      0 0 0 0 0 0 0 0 0  1  0  0
#   deckard   1 0 0 0 0 0 0 0 0  0  0  0
#   eldon     0 0 0 0 0 0 0 0 0  0  0  1
#   harry     0 0 0 0 0 0 0 1 0  0  0  0
#   kowalski  0 0 0 0 0 0 0 0 0  0  1  0
#   leon      0 0 0 0 0 0 0 0 0  0  1  0
#   rick      1 0 0 0 0 0 0 0 0  0  0  0
#   roy       0 0 1 0 0 0 0 0 0  0  0  0
#   sebastian 0 0 0 0 0 0 1 0 0  0  0  0
#   tyrell    0 0 0 0 0 0 0 0 0  0  0  1
which(colSums(m)>0)
#  1  3  7  8 10 11 12 
like image 170
jlhoward Avatar answered Nov 04 '22 21:11

jlhoward