Folks,
I'm stumped with the following challenge. I have a data set that looks like this:
BuyerID Fruit.1 Fruit.2 Fruit.3 Amount.1 Amount.2 Amount.3
879 Banana Apple 4 3
765 Strawberry Apple Orange 1 2 4
123 Orange Banana 1 1 1
11 Strawberry 3
773 Kiwi Banana 1 2
What I would like to do is to simplify the data (if possible) and collapse the "Fruit" and "Amount" variables
BuyerID Fruit Amount Total Count
879 "Banana" "Apple" 4 3 7 2
765 "Strawberry" "Apple" "Orange" 1 2 4 7 3
123 "Orange" "Banana" 1 1 1 3 2
11 "Strawberry" 3 3 1
773 "Kiwi" "Banana" 1 2 3 2
I have tried using c() and rbind() but they do not produce the results that I want - I've tried the tip here: data.frame rows to a list as well but am not too sure whether this is the best way to simplify my data.
This is so that presumably it would be easier for me to deal with fewer variables to count the occurrence of certain items for example (e.g. 60% buyers purchase Banana).
I hope this is doable - am also open to any suggestions. Any solutions appreciated!
Thank you.
If you want to concatenate a list of numbers ( int or float ) into a single string, apply the str() function to each element in the list comprehension to convert numbers to strings, then concatenate them with join() .
Since a list can contain any Python variables, it can even contain other lists.
Attempting to replicate your data, and using data.table
DT <- data.frame(
BuyerID = c(879,765,123,11,773),
Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
Fruit.3 = c( NA, 'Orange',NA,NA,NA),
Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
Total = c(7,7,3,3,3),
Count = c(2,3,2,1,2),
stringsAsFactors = FALSE)
# reshaping to long form and data.table
library(data.table)
DTlong <- data.table(reshape(DT, varying = list(Fruit = 2:4, Amount = 5:7),
direction = 'long'))
# create lists (without NA values)
# also adding count and total columns
# by using <- to save Fruit and Amount for later use
DTlist <- DTlong[, list(Fruit <- list(as.vector(na.omit(Fruit.1))),
Amount <- list(as.vector(na.omit(Amount.1))),
Count = length(unlist(Fruit)),
Total = sum(unlist(Amount))),
by = BuyerID]
BuyerID V1 V2 Count Total
1: 879 Banana,Apple 4,3 2 7
2: 765 Strawberry,Apple,Orange 1,2,4 3 7
3: 123 Orange,Banana 1,1,1 2 3
4: 11 Strawberry 3 1 3
5: 773 Kiwi,Banana 1,2 2 3
@RicardoSaporta edit:
You can skip the reshape step, if youd like, using list(list(c(....)))
This will probably save a fair bit of execution time (the drawback is it adds NA
s not blank spaces). However, as @Marius points out, DTlong
above is probably easier to work with.
DT <- data.table(DT)
DT[, Fruit := list(list(c( Fruit.1, Fruit.2, Fruit.3))), by=BuyerID]
DT[, Ammount := list(list(c(Amount.1, Amount.2, Amount.3))), by=BuyerID]
# Or as a single line
DT[, list( Fruit = list(c( Fruit.1, Fruit.2, Fruit.3)),
Ammount = list(c(Amount.1, Amount.2, Amount.3)),
Total, Count), # other columns used
by = BuyerID]
Here a solution , with base package. It is like Tyler solution but with a single apply.
res <- apply(DT,1,function(x){
data.frame(Fruit= paste(na.omit(x[2:4]),collapse=' '),
Amount = paste(na.omit(x[5:7]),collapse =','),
Total = sum(as.numeric(na.omit(x[5:7]))),
Count = length(na.omit(x[2:4])))
})
do.call(rbind,res)
Fruit Amount Total Count
1 Banana Apple 4, 3 7 2
2 Strawberry Apple Orange 1, 2, 4 7 3
3 Orange Banana 1, 1, 1 3 2
4 Strawberry 3 3 1
5 Kiwi Banana 1, 2 3 2
I would also change the index number by a grep, something like this
Fruit = gregexpr('Fruit[.][0-9]', colnames(dat)) > 0
Amount = gregexpr('Amount[.][0-9]', colnames(dat)) > 0
x[2:4] replace by x[which(Fruit)]....
EDIT add some benchmarking.
library(microbenchmark)
library(data.table)
microbenchmark(ag(),mn(), am(), tr())
Unit: milliseconds
expr min lq median uq max
1 ag() 11.584522 12.268140 12.671484 13.317934 109.13419
2 am() 9.776206 10.515576 10.798504 11.437938 137.44867
3 mn() 6.470190 6.805646 6.974797 7.290722 48.68571
4 tr() 1.759771 1.929870 2.026960 2.142066 7.06032
For a small data.frame, Tyler Rinker is the winner!! How I explain this (just a guess)
This is a really bad idea but here it is in base data.frame
. It works because data.frame
is actually a list of equal length vectors. You can force data.frame
to store vectors in cells but it takes some hackery. I'd suggest other formats including Marius's suggestion or a list.
DT <- data.frame(
BuyerID = c(879,765,123,11,773),
Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
Fruit.3 = c( NA, 'Orange',NA,NA,NA),
Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
stringsAsFactors = FALSE)
DT2 <- DT[, 1, drop=FALSE]
DT2$Fruit <- apply(DT[, 2:4], 1, function(x) unlist(na.omit(x)))
DT2$Amount <- apply(DT[, 5:7], 1, function(x) unlist(na.omit(x)))
DT2$Total <- sapply(DT2$Amount, sum)
DT2$Count <- sapply(DT2$Fruit, length)
Yielding:
> DT2
BuyerID Fruit Amount Total Count
1 879 Banana, Apple 4, 3 7 2
2 765 Strawberry, Apple, Orange 1, 2, 4 7 3
3 123 Orange, Banana 1, 1, 1 3 2
4 11 Strawberry 3 3 1
5 773 Kiwi, Banana 1, 2 3 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With