This is piggy backing on a question I answered last night as I am reconsidering how I'd like to format my data. I did search but couldn't find up with any applicable answer; I may be searching with wrong terms.
I have a data table with many rows that I'd like to combine:
record_numb <- c(1,1,1,2,2,2)
col_a <- c(123,'','',987,'','')
col_b <- c('','234','','','765','')
col_c <- c('','','543','','','543')
df <- data.frame(record_numb,col_a,col_b,col_c)
library(data.table)
setDT(df)
record_numb col_a col_b col_c
1 123
1 234
1 345
2 987
2 765
2 543
Each row will always have either col_a, col_b, or col_c populated. It will never have more than 1 of those 3 populated. I'd like to pivot(?) these into a single row per record so it appears like this:
record_numb col_a col_b col_c
1 123 234 345
2 987 765 543
I played with melt/cast a bit, but I'm such a novice at R that half of my issue is knowing what is available to use. There is just so much to use that I'm hoping one of you can point me to a package or function off the top of your head. My searches I performed pointed me to melt and cast and such, but I was unable to apply it to this case. I'm open to using any function or package.
As you suggested that you would like a data.table
solution in your comment, you could use
library(data.table)
df <- data.table(record_numb,col_a,col_b,col_c)
df[, lapply(.SD, paste0, collapse=""), by=record_numb]
record_numb col_a col_b col_c
1: 1 123 234 543
2: 2 987 765 543
.SD
basically says, "take all the variables in my data.table" except those in the by argument. In @Frank's answer, he reduces the set of the variables using .SDcols
. If you want to cast the variables into numeric, you can still do this in one line. Here is a chaining method.
df[, lapply(.SD, paste0, collapse=""), by=record_numb][, lapply(.SD, as.integer)]
The second "chain" casts all the variables as integers.
You can reshape to long format, drop the blank entries and then go back to wide:
res <- dcast(melt(df, id.vars = "record_numb")[ value != "" ], record_numb ~ variable)
record_numb col_a col_b col_c
1: 1 123 234 543
2: 2 987 765 543
You may find it more readable at first using magrittr:
library(magrittr)
res = df %>%
melt(id.vars = "record_numb") %>%
.[ value != "" ] %>%
dcast(record_numb ~ variable)
The numbers are still formatted as strings, but you can convert them with...
cols = setdiff(names(res), "record_numb")
res[, (cols) := lapply(.SD, type.convert), .SDcols = cols]
Type conversion will change each column to whatever class it looks like it should be (numeric, integer, whatever). See ?type.convert
.
Just do this :
df = df %>% group_by(record_numb) %>%
summarise(col_a = sum(col_a, na.rm = T),
col_b = sum(col_b, na.rm = T),
col_c = sum(col_c, na.rm = T))
.... inplace of 'sum' you could use min, max or whatever.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With