Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R, dplyr: Function that quickly builds list of complementary rows based on conditions

Tags:

r

dplyr

I have a data set of ~80,000 rows by 26 columns. The rows correspond to "SKUs" or unique IDs for robot building sets. The columns correspond to 26 different robot parts. A cell contains a part's contribution towards building a whole robot. The sum of a row's proportion may not sum to 1.0 since a building set won't always have 100% of the parts needed to build a whole robot.

The main goal is to build a function that accepts a SKU as input and outputs a list of complementary SKUs. A complementary row is defined as:

  1. if given row has a non-zero value for a column, then the complement must have a zero value for that column.

The goal is to find all possible sets of SKUs that complement a given SKU such that a whole robot can be built. Additionally, it is important to see the weighted revenue per robot ("weightedPrice") for this "Frankenstein" set of SKUs. It is also nice to show how the weightedPrice changes with the addition of each complementary SKU.

A minimum working, toy example (MWE):

set.seed(1)
a = runif(n=60, min=0, max=0.2)
a[a<0.12] = 0
n = 10
A = as.data.frame(matrix(a,              
                         nrow=n,         
                         ncol=6,         
                         byrow = TRUE))
A$rowTally <- rowSums(A != 0)
A$sku <- seq(from = 1, to = n)
A$totalDollarSales <- runif(n=n, min=1*10^2, max=1*10^6)
A$totalUnitSales <- runif(n=n, min=1*10^2, max=1*10^6)
names(A) <- c("p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop", "p6_prop", "rowTally", "sku", "totalDollarSales", "totalUnitSales")
A <- A[c("sku", "p1_prop", "p2_prop", "p3_prop", "p4_prop", "p5_prop", "p6_prop", "rowTally", "totalDollarSales", "totalUnitSales")]
A$dollarsPerRobot <- A$totalDollarSales/A$totalUnitSales
A

   sku   p1_prop   p2_prop   p3_prop   p4_prop   p5_prop   p6_prop rowTally
1    1 0         0         0         0.1816416 0         0.1796779        2
2    2 0.1889351 0.1321596 0.1258228 0         0         0                3
3    3 0.1374046 0         0.1539683 0         0.1435237 0.1983812        4
4    4 0         0.1554890 0.1869410 0         0.1303348 0                3
5    5 0         0         0         0         0.1739382 0                1
6    6 0         0         0         0         0.1654747 0.1336933        2
7    7 0.1588480 0         0.1447422 0         0.1641893 0.1294120        4
8    8 0.1565866 0         0         0.1578712 0         0                2
9    9 0.1464627 0.1385463 0         0.1722419 0         0                3
10  10 0         0         0         0         0.1324010 0                1
   totalDollarSales totalUnitSales dollarsPerRobot
1         912884.64       339139.0       2.6917711
2         293674.01       839456.4       0.3498383
3         459119.82       346748.8       1.3240703
4         332461.43       333841.6       0.9958659
5         650905.38       476403.6       1.3662898
6         258090.98       892209.1       0.2892718
7         478597.39       864353.0       0.5537059
8         766334.04       390050.5       1.9647044
9          84338.49       777343.0       0.1084959
10        875333.80       960621.9       0.9112157

I'm trying to write a function:

def frankensteinRobot(df, sku, skuRowTally):
    1. find another SKU in dataframe, df.
       - must have non-overlapping parts with existing SKU set
       - rowTally <= skuRowTally (want to find small SKUs to add)
       - must be relatively same number of totalUnitSales
    2. append new SKU to list, and take mininum of totalUnitSales. 
    3. Calculate the weighted, per robot price
       dollarsPerRobotSKU_1*(1/length(SKU_list))+...+dollarsPerRobotSKU_n*(1/length(SKU_list)) 
       and append to the end of a list so we can track profitability with each additional SKU.
    4. repeat steps 1, 2 & 3.

I've only been able to figure out how to find the next complementary SKU, but not the full set of SKUs:

A_candidates <- sapply(data.frame(outer(1:nrow(A), 1:nrow(A), Vectorize(check_compliment))), which)

Example input:

frankensteinRobot(df = A, sku = 5, skuRowTally = 3)

Example output (note that because my MWE only has 10 rows, the example output lists only have 2 elements, but in actually they will be longer. Also, I'm unsure as to what data structure is appropriate. Maybe a dataframe where 1 column is a list?):

[list of SKUs]; [propSum]; [maxLb]; [list of weightedPrice]

[5, 2]; [propSum=0.6208557]; [maxLb=476403.6]; [0.8580641)
[5, 8]; [propSum=0.488396]; [maxLb=390050.5]; [1.665497]
[5, 9]; [propSum=0.6311891]; [maxLb=476403.6]; [0.7373929]

Example input:

frankensteinRobot(df = A, sku = 6, skuRowTally = 2)

Example output:

[6, 8]; [propSum=0.6136258]; [maxLb=390050.5]; [1.126988]
like image 280
user2205916 Avatar asked Apr 08 '19 16:04

user2205916


1 Answers

Final Edit: This solution relies on data.table and could use a loop at the end. Still, you could keep on copying and pasting to make it work. The solution relies mainly on this blazingly quick solution:

  search_dt <- dt[sku1 == searchSKU]
  current_parts <- names(search_dt[, .SD, .SDcols = part_names])[which(search_dt[, .SD, .SDcols = part_names]>0)]

  steal_dt <- dt[rowTally <= searchRowTally]

#returns SKUs which are 0 for the parts we already have
  steal_dt <- steal_dt[steal_dt[, j = rowSums(.SD) == 0, .SDcols = current_parts]]

  franken_rob <- cbind(search_dt, steal_dt)

I'm still uncertain on some of the criteria. I assume that as FrankenBot is assembled, each successive SKU can't be in the previous parts. In other words, SKU3 doesn't have any parts in common with SKU1 and SKU2.

Solution output [I edited it one more time...]{and yet another...}:

# A tibble: 15 x 8
    sku1  sku2  sku3 propSums Parts Robots dollarsPerRobot totalUnitSales
   <int> <int> <int>    <dbl> <dbl>  <dbl>           <dbl>          <dbl>
 1     1     2     5    0.982     6      3           1.47         339139.
 2     1     2    10    0.941     6      3           1.32         339139.
 3     1     4    NA    0.834     5      2           1.84         333842.
 4     1     5    NA    0.535     3      2           2.03         339139.
 5     1    10    NA    0.494     3      2           1.80         339139.
 6     2     5    NA    0.621     4      2           0.858        476404.
 7     2     6    NA    0.746     5      2           0.320        839456.
 8     2    10    NA    0.579     4      2           0.631        839456.
 9     4     8    NA    0.787     5      2           1.48         333842.
10     5     8    NA    0.488     3      2           1.67         390051.
11     5     9    NA    0.631     4      2           0.737        476404.
12     6     8    NA    0.614     4      2           1.13         390051.
13     6     9    NA    0.756     5      2           0.199        777343.
14     8    10    NA    0.447     3      2           1.44         390051.
15     9    10    NA    0.590     4      2           0.510        777343.

solution code:

library(data.table)
# generate data -----------------------------------------------------------

set.seed(1)
n = 10
cols = 6 #added
part_names =  paste0('p', c(1:cols), '_prop')

a = runif(n* cols, min=0, max=0.2)
a[a<0.12] = 0

A <- data.table(matrix(a, nrow=n, ncol=cols,byrow = TRUE))
A[, `:=`(rowTally1 = rowSums(.SD != 0),
         sku1 = .I
         ,totalDollarSales1 = runif(n=n, min=1*10^2, max=1*10^6)
         ,totalUnitSales1 =  runif(n=n, min=1*10^2, max=1*10^6))]

A[, dollarsPerRobot1:=totalDollarSales1/totalUnitSales1]

setnames(A, c(paste0('V',1:cols)), part_names)
setcolorder(A, 'sku1')

non_part_names<- setdiff(names(A), c('sku1',part_names))
non_part_names<- stringr::str_sub(non_part_names, 1, -2)

search_fun <- function (search_dt, steal_dt, searchSKU, b_loop = FALSE, sale_range = NULL) {

  sku_count<- length(grep('sku', names(search_dt)))
  skus <- paste0('sku', 1:(sku_count+1))

  non_parts<- paste0(non_part_names, rep(1:(sku_count+1), each = length(non_part_names)))

  blank_table <- setnames(data.table(matrix(nrow = 0, ncol = length(search_dt) + 1 + length(non_part_names))),c(skus,part_names, non_parts))

  if (length(searchSKU) != sku_count) {
    stop('not enough SKUs to go around')
  } 

  for (i in 1:length(searchSKU)) {
    search_dt <- search_dt[get(paste0('sku', i)) == searchSKU[i]]
  }
  current_parts <- names(search_dt[, .SD, .SDcols = part_names])[which(search_dt[, .SD, .SDcols = part_names]>0)]
  search_dt[, (setdiff(part_names, current_parts)) := NULL, ]

  # Could be made faster if sku1s were filtered out to whichever ones were is sku.N 
  # Right now it still looks through skus that may have already been filtered out.

  if (!is.null(sale_range)) {
    if (length(sale_range) != 2) {
      warning('Sale range needs to be length two with sale_range[1] = lower range and sale_range[2] = upper range')
    } else {
    steal_dt <- steal_dt[between(totalUnitSales1, sale_range[1] * search_dt$totalUnitSales1, search_dt$totalUnitSales1 * sale_range[2])]
    }
  }


  if (b_loop) {
    steal_dt <- steal_dt[sku1 > searchSKU[sku_count]]
  }

  steal_dt <- steal_dt[steal_dt[, j = rowSums(.SD) == 0, .SDcols = current_parts]]
  if (nrow(steal_dt) == 0) {
    return(blank_table)
  }

  steal_dt[, (current_parts) := NULL]
  setnames(steal_dt,
           c('sku1', paste0(non_part_names, '1')) ,
           c(paste0('sku',sku_count+1),
             paste0(non_part_names, sku_count+1))
  )

  franken_rob <- cbind(search_dt, steal_dt)
  setcolorder(franken_rob, c(skus, part_names))
  return(franken_rob)

}

searchRowTally <- 3
dt_search <- A

#this is done outside the function because there can be a lot of looping otherwise
dt_steal <- dt_search[rowTally1 <= searchRowTally]

#Near-instant with 80,000 rows and 26 columns
search_fun(dt_search, dt_steal, dt_search$sku1[5])
search_fun(dt_search, dt_steal, dt_search$sku1[5], b_loop = TRUE)
search_fun(dt_search, dt_steal, dt_search$sku1[5], sale_range = c(0.8, 1.2))
search_fun(dt_search, dt_steal, dt_search$sku1[5], b_loop = TRUE, sale_range = c(0.8, 1.2))

#Not doable with 80,000 rows, but still nice
rbindlist(lapply(1:(n-1), function (i) search_fun(dt_search, dt_steal, dt_search$sku1[i], b_loop = TRUE)))
rbindlist(lapply(1:(n-1), function (i) search_fun(dt_search, dt_steal, dt_search$sku1[i], b_loop = TRUE, sale_range = c(0.8, 1.2))))

#much more likely that the first regression would be a single value
# frank_1 <- search_fun(dt_search, dt_steal, dt_search$sku1[5], FALSE)
frank_1 <-  rbindlist(lapply(1:(n-1), function (i) search_fun(dt_search, dt_steal, dt_search$sku1[i],  TRUE)))

#This takes every n-1 of each sku1 group. 
frank_2 <- frank_1[frank_1[, head(.I, -1), by = sku1]$V1]
# frank_2 <- frank_1[, j = if(.N!=1) .SD, by = sku1]
dt_steal2 <- dt_steal[sku1 %in% base::unique(frank_1$sku2)]

frank_2 = rbindlist(lapply(1:nrow(frank_2), function (i) search_fun(frank_2, dt_steal2, melt(frank_2[i, .SD, .SDcols = grep('sku', names(frank_2))])[[2]],  TRUE)))

frank_3 <- frank_2[frank_2[, head(.I, -1), by = sku2]$V1]
dt_steal3 <- dt_steal2[sku1 %in% base::unique(frank_2$sku3)]

frank_3 = rbindlist(lapply(1:nrow(frank_3), function (i) search_fun(frank_3, dt_steal3, melt(frank_3[i, .SD, .SDcols = grep('sku', names(frank_3))])[[2]],  TRUE)))


# start combindine our lists

franken_rob <- frank_1[!frank_2, on = c('sku1', 'sku2')]
franken_rob[, j= sku3:= integer()]
setcolorder(franken_rob, c('sku1','sku2','sku3'))

franken_rob <- rbind(frank_2, franken_rob, fill = TRUE)
#do above for frank_n times)

franken_rob[, `:=`(propSums=rowSums(.SD),
                   Parts = rowSums(.SD > 0))
            , .SDcols = part_names]

franken_rob[, Robots:= rowSums(.SD > 0, na.rm = TRUE), .SDcols = grep('sku', names(franken_rob))]
franken_rob[, dollarsPerRobot := rowSums(.SD, na.rm = TRUE) / Robots, .SDcols = grep ('dollarsPerRobot', names(franken_rob))]
franken_rob[, totalUnitSales := do.call(pmin,  c(.SD, list(na.rm = TRUE))), .SDcols = grep('totalUnitSales', names(franken_rob))]

franken_rob[, (part_names) := NULL]
franken_rob

tibble::as_tibble(franken_rob[, c(1:3, 16, 17, 18, 19,20)])

Edit: I don't have enough rep to comment - when trying the data.table solution with 80,000 rows and 26 columns, it tries to allocate a 2.3 GB vector when rowTally <= 13. However, when I change that to 3, it makes 1.1 million rows and filters down to 0.3 million rows. This is super cartesian.

Original: Here is a dplyr solution that seems to work with 80,000 rows and 26 columns. The trick was to figure out which columns had a non-zero result for the subset sku. With those columns, I went back to the original df and filtered.

There's also a line commented out for the unitSales being in some range.


set.seed(1)
n = 10
cols = 6 #added

part_names =  paste0('p', c(1:cols), '_prop') #added
a = runif(n * cols, min=0, max=0.2) #changed from n to n * cols
a[a<0.12] = 0

A = as.data.frame(matrix(a,              
                         nrow=n,         
                         ncol=cols,  #changed to cols      
                         byrow = TRUE))
A$rowTally <- rowSums(A != 0)
A$sku <- seq(from = 1, to = n)
A$totalDollarSales <- runif(n=n, min=1*10^2, max=1*10^6)
A$totalUnitSales <- runif(n=n, min=1*10^2, max=1*10^6)
names(A) <- c(part_names, "rowTally", "sku", "totalDollarSales", "totalUnitSales")
A <- A[c("sku", part_names, "rowTally", "totalDollarSales", "totalUnitSales")]
A$dollarsPerRobot <- A$totalDollarSales/A$totalUnitSales


library(dplyr)

df <- as_tibble(A)%>%
  mutate(propSum = rowSums(.[, part_names]))

search_sku <- 5
skuRowTally <- 3

search_df <- df%>%
  filter(sku == search_sku)

current_parts <- search_df%>%
  select(part_names)%>%
  select_if(~sum(.)> 0)%>%
  names()

non_current_parts <- search_df%>%
  select(part_names)%>%
  select_if(~sum(.) == 0)%>%
  names()

df%>%
  filter(rowTally <= skuRowTally,
         sku != search_sku
         # , between(totalUnitSales, 0.7 * search_df$totalUnitSales, 1.3 * search_df$totalUnitSales)
         )%>%
  filter_at(vars(current_parts), all_vars(. == 0))%>%
  filter_at(vars(non_current_parts), any_vars(. != 0))%>%
  rowwise()%>%
  transmute(sku_search = search_sku,
            sku = sku,
            propSum = propSum + search_df$propSum,
            minLB = min(totalUnitSales, search_df$totalUnitSales),
            weightedPrice = (dollarsPerRobot + search_df$dollarsPerRobot) / 2,
            total_parts = rowTally + search_df$rowTally,
            complete_robot = if_else(total_parts == cols, 'COMPLETE', 'incomplete')
  )%>%
  ungroup()


frankensteinRobot <- function (df, sku1, skuTally) {
  # df <- as_tibble(df)%>%
  #   mutate(propSum = rowSums(.[, part_names]))


#part_name and cols would also need to be passed to make this
#completely stand alone.  

  search_sku <- sku1
  skuRowTally <- skuTally

  search_df <- df%>%
    filter(sku == search_sku)

  current_parts <- search_df%>%
    select(part_names)%>%
    select_if(~sum(.)> 0)%>%
    names()

  non_current_parts <- search_df%>%
    select(part_names)%>%
    select_if(~sum(.) == 0)%>%
    names()

  df%>%
    filter(rowTally <= skuRowTally,
           sku > search_sku
           # , between(totalUnitSales, 0.7 * search_df$totalUnitSales, 1.3 * search_df$totalUnitSales)
    )%>%
    filter_at(vars(current_parts), all_vars(. == 0))%>%
    filter_at(vars(non_current_parts), any_vars(. != 0))%>%
    rowwise()%>%
    transmute(sku_search = search_sku,
              sku = sku,
              propSum = propSum + search_df$propSum,
              minLB = min(totalUnitSales, search_df$totalUnitSales),
              weightedPrice = (dollarsPerRobot + search_df$dollarsPerRobot) / 2,
              total_parts = rowTally + search_df$rowTally,
              complete_robot = if_else(total_parts == cols, 'COMPLETE', 'incomplete')
    )%>%
    ungroup()
}

A<- as_tibble(A)%>%
  mutate(propSum = rowSums(.[, part_names]))

#I tried running 1:n with 80,000 rows. It wasn't pretty

bind_rows(lapply(1:n, function(x) frankensteinRobot(A, x, 3)))

edit: here's an attempt at a data.table solution. It's got some similarities but instead of doing it as a loop, it's one go around. If I could figure out how to get your main condition of no matching parts, it probably wouldn't be too shabby. Right now the bottleneck is memory and this as I can't get intersect to work on my list of lists.

results[
apply(results[, .(current_parts, rbt_missing_curr_parts)], 1, function(x) length(intersect(x[[1]], x[[2]]))==0)
]

Main code:

library(data.table)

dt <- as.data.table(A)

dt[
  ,j = `:=`(propSum = rowSums(.SD),
           current_parts = list(which(.SD > 0)),
           missing_parts = list(which(.SD == 0)))
  ,.SDcols = part_names,
  by = sku]

#could subset here as dt[1:100, ...] which would allow bigger datasets
dt_missing_parts <- dt[, .( sku, propSum, current_parts, rowTally, missing_parts, dollarsPerRobot, up_range = 1.3 *totalUnitSales, low_range = 0.7 * totalUnitSales)]

results<- dt_missing_parts[dt[rowTally <= round(cols / 2)],
                 j = .(i.sku, sku,
                       propSum = propSum + i.propSum, 
                       dollarsPerRobot = (dollarsPerRobot + i.dollarsPerRobot) / 2,
                       totalUnitSales = pmin(totalUnitSales, i.totalUnitSales),
                       rbt_missing_curr_parts = i.current_parts, 
                       current_parts,
                       rpt_missing_missing_parts= i.missing_parts,
                       missing_parts,
                       total_parts = rowTally + i.rowTally),
                 on = .(sku > sku
                        #more conditions would be great
                        # ,low_range < totalUnitSales
                        # ,up_range > totalUnitSales
                        ),
                 allow.cartesian = TRUE,
                 nomatch = 0L,
                 by = .I
                 ]
results
results[
apply(results[, .(current_parts, rbt_missing_curr_parts)], 1, function(x) length(intersect(x[[1]], x[[2]]))==0)
]
like image 123
Cole Avatar answered Sep 20 '22 15:09

Cole