Custom aggregation on two columns naming fruit

Question

I want to aggregate two columns of a data frame by name, in the following somewhat special way:

drop the parts column in the result by specially aggregating the two columns fruits and parts
while the parts values for Apple, Banana and Strawberry doesn't matter and everything gets summarized, the parts values of Grape and Kiwi should become the new fruits name
result (at bottom) should have 8 aggregated rows instead of 20

This may sound dead simple on the first sight, but after hours of trial and error I didn't find any useful solution. Here's the example:

theDF <- data.frame(dates = as.Date(c(today()+20)),
    fruits = c("Apple","Apple","Apple","Apple","Banana","Banana","Banana","Banana",
      "Strawberry","Strawberry","Strawberry","Strawberry","Grape", "Grape",
      "Grape","Grape", "Kiwi","Kiwi","Kiwi","Kiwi"),
    parts = c("Big Green Apple","Apple2","Blue Apple","XYZ Apple4",
      "Yellow Banana1","Small Banana","Banana3","Banana4",
      "Red Small Strawberry","Red StrawberryY","Big Strawberry",
       "StrawberryZ","Green Grape", "Blue Grape", "Blue Grape",
       "Blue Grape","Big Kiwi","Small Kiwi","Big Kiwi","Middle Kiwi"),
    stock = as.vector(sample(1:20)) )

The current data frame:

enter image description here

The desired output:

enter image description here

akrun · Accepted Answer

We can use data.table. If there are patterns like the end character is capital letter or a number in 'parts' column to be removed, we can use sub to do that and use as a grouping variable along with 'dates' and get the sum of the 'stock'.

library(data.table)
setDT(theDF)[,.(stock = sum(stock)) , .(dates, fruits = sub("([0-9]|[A-Z])$", "", parts))]
#        dates      fruits stock
#1: 2016-06-19       Apple    46
#2: 2016-06-19      Banana    35
#3: 2016-06-19  Strawberry    38
#4: 2016-06-19 Green Grape    12
#5: 2016-06-19  Blue Grape    21
#6: 2016-06-19    Big Kiwi    37
#7: 2016-06-19  Small Kiwi    14 
#8: 2016-06-19 Middle Kiwi     7

Or using dplyr, we can similarly implement the same methodology.

library(dplyr)
theDF %>%
    group_by(dates, fruits = sub('([0-9]|[A-Z])$', '', parts)) %>% 
    summarise(stock = sum(stock))

Update

If there are no patterns and only based on manually identifying the elements in 'fruits', create a vector of elements, use %chin% to get the logical index in 'i', assign (:=) the values in 'parts' corresponding to the 'i' to 'fruits', then do the group by 'dates', 'fruits' and get the sum of 'stock'.

setDT(theDF)[as.character(fruits) %chin% c("Grape", "Kiwi"),
          fruits := parts][, .(stock = sum(stock)), .(dates, fruits)]

data

theDF <- structure(list(dates = structure(c(16971, 16971, 16971, 16971, 
16971, 16971, 16971, 16971, 16971, 16971, 16971, 16971, 16971, 
16971, 16971, 16971, 16971, 16971, 16971, 16971), class = "Date"), 
    fruits = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 5L, 
    5L, 5L, 5L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), .Label = c("Apple", 
    "Banana", "Grape", "Kiwi", "Strawberry"), class = "factor"), 
    parts = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 14L, 
    15L, 16L, 16L, 11L, 10L, 10L, 10L, 9L, 13L, 9L, 12L), .Label = c("Apple1", 
    "Apple2", "Apple3", "Apple4", "Banana1", "Banana2", "Banana3", 
    "Banana4", "Big Kiwi", "Blue Grape", "Green Grape", "Middle Kiwi", 
    "Small Kiwi", "StrawberryX", "StrawberryY", "StrawberryZ"
    ), class = "factor"), stock = c(8, 19, 15, 4, 6, 18, 1, 10, 
    9, 16, 11, 2, 12, 13, 5, 3, 17, 14, 20, 7)), .Names = c("dates", 
"fruits", "parts", "stock"), row.names = c(NA, -20L), class = "data.frame")

Custom aggregation on two columns naming fruit

Tags:

dataframe

r

aggregate

MHN

1 Answers

Update

data

akrun

Recent Activity

Donate For Us

Custom aggregation on two columns naming fruit

Tags:

dataframe

r

aggregate

MHN

1 Answers

Update

data

akrun

Related questions

Recent Activity

Donate For Us