Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to split a vector of a full name in to 2 separate vectors

Tags:

string

r

strstr

I have a vector consisting of full names with the first and last name separated by a comma this is what the first few elements look like:

> head(val.vec)
[1] "Aabye,ֲ Edgar"        "Aaltonen,ֲ Arvo"      "Aaltonen,ֲ Paavo"    
[4] "Aalvik Grimsb,ֲ Kari" "Aamodt,ֲ Kjetil Andr" "Aamodt,ֲ Ragnhild

I am looking for a way to split them in to 2 separate columns of first and last name. My final intention is to have both of them as a part of a bigger data frame.

I tried using strsplit function like this

names<-unlist(strsplit(val.vec,','))

but it gave me one long vector instead of 2 separate sets, I know it is Possible to use a loop and go over all the elements and place the first and last name in 2 separate vectors, but it is a little time consuming considering the fact that there are about 25000 records.

I saw a few similar questions but the discussion was how to do it on C+ and Java

like image 601
Lee Avatar asked Jul 22 '16 15:07

Lee


2 Answers

We can use read.csv to convert the vector into a data.frame with 2 columns

read.csv(text=val.vec, header=FALSE, stringsAsFactors=FALSE)

Or if we are using strsplit, instead of unlisting (which will convert the whole list to a single vector), we can extract the first and second elements in the list separately to create two vectors ('v1' and 'v2').

lst <- strsplit(val.vec,',')
v1 <- lapply(lst, `[`, 1)
v2 <- lapply(lst, `[`, 2)

Yet another option would be sub

v1 <- sub(",.*", "", val.vec)
v2 <- sub("[^,]+,", "", val.vec)

data

val.vec <- c("Aabye,ֲ Edgar", "Aaltonen,ֲ Arvo", "Aaltonen,ֲ Paavo", 
        "Aalvik Grimsb,ֲ Kari", "Aamodt,ֲ Kjetil Andr", "Aamodt,ֲ Ragnhild")
like image 175
akrun Avatar answered Sep 27 '22 17:09

akrun


Another option:

library(stringi)
stri_split_fixed(val.vec, ",", simplify = TRUE)

Which gives:

#     [,1]            [,2]          
#[1,] "Aabye"         "ֲ Edgar"      
#[2,] "Aaltonen"      "ֲ Arvo"       
#[3,] "Aaltonen"      "ֲ Paavo"      
#[4,] "Aalvik Grimsb" "ֲ Kari"       
#[5,] "Aamodt"        "ֲ Kjetil Andr"
#[6,] "Aamodt"        "ֲ Ragnhild"  

Should you want the result in a data.frame, you could wrap it in as.data.frame()

like image 43
Steven Beaupré Avatar answered Sep 27 '22 16:09

Steven Beaupré