Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting first names in R

Tags:

regex

r

Say I have a vector of peoples' names in my dataframe:

names <- c("Bernice Ingram", "Dianna Dean", "Philip Williamson", "Laurie Abbott",
           "Rochelle Price", "Arturo Fisher", "Enrique Newton", "Sarah Mann",
           "Darryl Graham", "Arthur Hoffman")

I want to create a vector with the first names. All I know about them is that they come first in the vector above and that they're followed by a space. In other words, this is what I'm looking for:

"Bernice" "Dianna"  "Philip" "Laurie" "Rochelle"
"Arturo"  "Enrique" "Sarah"  "Darryl" "Arthur"

I've found a similar question here, but the answers (especially this one) haven't helped much. So far, I've tried a couple of variations of function from the grep family, and the closest I could get to something useful was by running strsplit(names, " ") to separate first names and then strsplit(names, " ")[[1]][1] to get just the first name of the first person. I've been trying to tweak this last command to give me a whole vector of first names, to no avail.

like image 372
Waldir Leoncio Avatar asked Oct 11 '13 15:10

Waldir Leoncio


3 Answers

Use sapply to extract the first name:

> sapply(strsplit(names, " "), `[`, 1)
 [1] "Bernice"  "Dianna"   "Philip"   "Laurie"   "Rochelle" "Arturo"   "Enrique" 
 [8] "Sarah"    "Darryl"   "Arthur"

Some comments:

The above works just fine. To make it a bit more general you could change the split parameter in strsplit function from " " in "\\s+" which covers multiple spaces. Then you also could use gsub to extract directly everything before a space. This last approach will use only one function call and likely to be faster (but I haven't check with benchmark).

like image 109
Michele Avatar answered Oct 17 '22 07:10

Michele


For what you want, here's a pretty unorthodox way to do it:

read.table(text = names, header = FALSE, stringsAsFactors=FALSE, fill = TRUE)[[1]]
# [1] "Bernice"  "Dianna"   "Philip"   "Laurie"   "Rochelle" "Arturo"   "Enrique"  "Sarah"   
# [9] "Darryl"   "Arthur"  
like image 31
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 17 '22 06:10

A5C1D2H2I1M1N2O1R2T1


This seems to work:

unlist(strsplit(names,' '))[seq(1,2*length(names),2)]

Assuming no first/last names have spaces in them.

like image 3
zzxx53 Avatar answered Oct 17 '22 08:10

zzxx53