Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

splitting and reordering character string by comma in r

I have several years worth of data on individuals, but their names are formatted differently each year. Half of the names are already in "First Last" order but I can't figure out how to successfully edit the other half ("Last, First").

Here's a sample df:

name <- c("First1 Last1","Last2, First2", "Last3, First3", "First4 Last4", "First5 Last5")
salary <-c(51000, 72000,125000,67000,155000)
year <-c(2012,2014,2013,2013,2014)

df <- data.frame(name, salary, year, stringsAsFactors=FALSE)

Here are things I've tried: split up the text by comma:

df$name2 <- strsplit(df$name, ", ") #to split the character string by comma
df$name3 <-paste(df$name2, collapse=" ") #to collapse the newly created vectors back into a string
df$name4 <-paste(rev(df$name2)) #to try pasting each vector in reverse order
df$name5 <-paste(rev(df$name2)[2:1]) #trying again...

I've printed the correct names, but backwards, and printed them on the wrong rows, but despite all googling I can't get it to work correctly. What am I doing wrong?

like image 743
jesstme Avatar asked Oct 25 '16 21:10

jesstme


2 Answers

You can use a regular expression:

df$name <- sub("(L[A-Za-z0-9]+).*\\s+(F[A-Za-z0-9]+).*","\\2 \\1",df$name)

# df
#           name salary year
# 1 First1 Last1  51000 2012
# 2 First2 Last2  72000 2014
# 3 First3 Last3 125000 2013
# 4 First4 Last4  67000 2013
# 5 First5 Last5 155000 2014

The code looks for a word beginning with an uppercase L, followed by some letters / digits, then by some symbols, a space, then a word beginnign with an uppercase F, some letters / digits and then some symbols.

It then reorders the two words by putting first the one beginning with an F (that is, (F[A-Za-z0-9]+)), then the one beginning with an L ( that is, (L[A-Za-z0-9]+)).

As you can see, the code removes the comma (it seems to be your desired output).

With the new info, use the code :

df$name <- sub('(.*)\\,\\s+(.*)','\\2 \\1', df$name)

# sub('(.*)\\,\\s+(.*)','\\2 \\1',name)
# [1] "John Smith"       "Marcus Green"     "Mario Sanchez"    "Jennifer Roberts" "Sammy Lee"

Here, we are looking for characters before a comma, followed by a space and then by other characters. We then reorder the first and the second group to have the desired output.

Note: I assumed that if there is no comma, then the names are already in the correct order (that seems to be the case in your comment).

like image 70
etienne Avatar answered Nov 15 '22 01:11

etienne


I think this is what you want. You were really close, you need both a rev and a paste(..., collapse = " "). I also trim whitespace, but that may not be necessary.

# look for commas to see which rows need fixing
needs_rearranging = grep(",", df$name)
df$name[needs_rearranging] = 
           # split on the comma space, then
    sapply(strsplit(df$name[needs_rearranging], split = ", "),
       function(x) {
           # remove whitespace, reverse the order, and 
           # paste them back together
           paste(rev(trimws(x)), collapse = " ")
       })

df
#           name salary year
# 1 First1 Last1  51000 2012
# 2 First2 Last2  72000 2014
# 3 First3 Last3 125000 2013
# 4 First4 Last4  67000 2013
# 5 First5 Last5 155000 2014
like image 25
Gregor Thomas Avatar answered Nov 15 '22 01:11

Gregor Thomas