I have a vector like the one below and would like to determine which elements in the list are human names and which are not. I found the humaniformat package, which formats names but unfortunately does not determine if a string is in fact a name. I also found a few packages for entity extraction, but they seem to require actual text for part-of-speech tagging, rather than a single name.
Example
pkd.names.quotes <- c("Mr. Rick Deckard", # Name
"Do Androids Dream of Electric Sheep", # Not a name
"Roy Batty", # Name
"How much is an electric ostrich?", # Not a name
"My schedule for today lists a six-hour self-accusatory depression.", # Not a name
"Upon him the contempt of three planets descended.", # Not a name
"J.F. Sebastian", # Name
"Harry Bryant", # Name
"goat class", # Not a name
"Holden, Dave", # Name
"Leon Kowalski", # Name
"Dr. Eldon Tyrell") # Name
Here is one approach. The US Census Bureau tabulates a list of surnames occurring > 100 times in its database (with frequency): all 152,000 of them. If you use the full list, all of your strings have a name. For instance, "class", "him" and "the" are names in certain languages (not sure which languages though). Similarly, there are many lists of first names (see this post).
The code below grabs all the surnames from the 2000 Census, and a list of first names from the post cited, then subsets to the most common 10,000 on each list, combines and cleans the lists, and uses that as a dictionary in the tm
package to identify which strings contain names. You can control the "sensitivity" by altering the freq
variable (freq=10,000 seems to generate the result you want).
url <- "http://www2.census.gov/topics/genealogy/2000surnames/names.zip"
tf <- tempfile()
download.file(url,tf, mode="wb") # download archive of surname data
files <- unzip(tf, exdir=tempdir()) # unzips and returns a vector of file names
surnames <- read.csv(files[grepl("\\.csv$",files)]) # 152,000 surnames occurring >100 times
url <- "http://deron.meranda.us/data/census-derived-all-first.txt"
firstnames <- read.table(url(url), header=FALSE)
freq <- 10000
dict <- unique(c(tolower(surnames$name[1:freq]), tolower(firstnames$V1[1:freq])))
library(tm)
corp <- Corpus(VectorSource(pkd.names.quotes))
tdm <- TermDocumentMatrix(corp, control=list(tolower=TRUE, dictionary=dict))
m <- as.matrix(tdm)
m <- m[rowSums(m)>0,]
m
# Docs
# Terms 1 2 3 4 5 6 7 8 9 10 11 12
# bryant 0 0 0 0 0 0 0 1 0 0 0 0
# dave 0 0 0 0 0 0 0 0 0 1 0 0
# deckard 1 0 0 0 0 0 0 0 0 0 0 0
# eldon 0 0 0 0 0 0 0 0 0 0 0 1
# harry 0 0 0 0 0 0 0 1 0 0 0 0
# kowalski 0 0 0 0 0 0 0 0 0 0 1 0
# leon 0 0 0 0 0 0 0 0 0 0 1 0
# rick 1 0 0 0 0 0 0 0 0 0 0 0
# roy 0 0 1 0 0 0 0 0 0 0 0 0
# sebastian 0 0 0 0 0 0 1 0 0 0 0 0
# tyrell 0 0 0 0 0 0 0 0 0 0 0 1
which(colSums(m)>0)
# 1 3 7 8 10 11 12
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With