Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Separate a shopping list into multiple columns

Tags:

r

tidyr

I have a shopping list data like this:

df <- data.frame(id = 1:5, item = c("apple2milk5", "milk1", "juice3apple5", "egg10juice1", "egg8milk2"), stringsAsFactors = F)

#   id         item
# 1  1  apple2milk5
# 2  2        milk1
# 3  3 juice3apple5
# 4  4  egg10juice1
# 5  5    egg8milk2

I want to separate the variable item into multiple columns and record the number behind the goods. The problem I met is that the goods each person purchases are different so I cannot solve it using tidyr::separate() or other analogous functions. What I expect is:

#   id apple milk  juice egg  
# 1  1 2     5     NA    NA   
# 2  2 NA    1     NA    NA   
# 3  3 5     NA    3     NA   
# 4  4 NA    NA    1     10   
# 5  5 NA    2     NA    8    

Note: The categories of goods in the market are unknown. So don't assume there are only 4 kinds of goods.

Thanks for any helps!

like image 291
Darren Tsai Avatar asked Nov 12 '19 15:11

Darren Tsai


People also ask

How do I split a column into multiple columns in list in pandas?

In Pandas, the apply() method can also be used to split one column values into multiple columns. The DataFrame. apply method() can execute a function on all values of single or multiple columns. Then inside that function, we can split the string value to multiple values.

How do you split a list in a DataFrame?

To split a pandas column of lists into multiple columns, create a new dataframe by applying the tolist() function to the column. The following is the syntax. You can also pass the names of new columns resulting from the split as a list. Let's see it action with the help of an example.


Video Answer


3 Answers

I just came up with a tidyverse solution which uses stringr::str_extract_all() to extract the quantities, sets their names as product names, and expands them to wide using tidyr::unnest_wider().

library(tidyverse)

df %>%
  mutate(N = map2(str_extract_all(item, "\\d+"), str_extract_all(item, "\\D+"), set_names)) %>%
  unnest_wider(N, transform = as.numeric)

# # A tibble: 5 × 6
#      id item         apple  milk juice   egg
#   <int> <chr>        <dbl> <dbl> <dbl> <dbl>
# 1     1 apple2milk5      2     5    NA    NA
# 2     2 milk1           NA     1    NA    NA
# 3     3 juice3apple5     5    NA     3    NA
# 4     4 egg10juice1     NA    NA     1    10
# 5     5 egg8milk2       NA     2    NA     8
like image 160
Darren Tsai Avatar answered Oct 13 '22 07:10

Darren Tsai


I'll add yet another answer. It only slightly differs from @ASuliman's but uses a bit of the newer tidyr and some cute regex to become a bit more straightforward.

The regex trick is that the pattern "(?<=\\d)\\B(?=[a-z])" will match the non-boundary (i.e. an empty location) between numbers and letters, allowing you to create rows for every "apple5" type of entry. Extract the letters into an item column and numbers into a count column. Using the new pivot_wider which replaces spread, you can convert those counts to numeric values as you reshape.

library(dplyr)
library(tidyr)

df %>%
  separate_rows(item, sep = "(?<=\\d)\\B(?=[a-z])") %>%
  extract(item, into = c("item", "count"), regex = "^([a-z]+)(\\d+)$") %>%
  pivot_wider(names_from = item, values_from = count, values_fn = list(count = as.numeric))
#> # A tibble: 5 x 5
#>      id apple  milk juice   egg
#>   <int> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     5    NA    NA
#> 2     2    NA     1    NA    NA
#> 3     3     5    NA     3    NA
#> 4     4    NA    NA     1    10
#> 5     5    NA     2    NA     8
like image 33
camille Avatar answered Oct 13 '22 08:10

camille


Possibily something like this, and should work with any item/quantity. It just assumes that the quantity follows the item.

Lets use a custom function which extracts item and quantity:

my_fun <- function(w) {
  items <- stringr::str_split(w, "\\d+", simplify = T)
  items <- items[items!=""] # dont now why but you get en empty spot each time
  quantities <- stringr::str_split(w, "\\D+", simplify = T)
  quantities <- quantities[quantities!=""]

  d <- data.frame(item = items, quantity=quantities, stringsAsFactors = F)


  return(d)

}

Example:

my_fun("apple2milk5")
# gives:
#    item quantity
# 1 apple        2
# 2  milk        5

Now we can apply the function to each id, using nest and map:

library(dplyr)
df_result <- df %>% 
  nest(item) %>% 
  mutate(res = purrr::map(data, ~my_fun(.x))) %>% 
  unnest(res)

df_results
# # A tibble: 9 x 3
# id item  quantity
# <int> <chr> <chr>   
# 1     1 apple 2       
# 2     1 milk  5       
# 3     2 milk  1       
# 4     3 juice 3       
# 5     3 apple 5       
# 6     4 egg   10      
# 7     4 juice 1       
# 8     5 egg   8       
# 9     5 milk  2 

Now we can use dcast() (probabily spread would work too):

data.table::dcast(df_result, id~item, value.var="quantity")

#     id apple  egg juice milk
#   1  1     2 <NA>  <NA>    5
#   2  2  <NA> <NA>  <NA>    1
#   3  3     5 <NA>     3 <NA>
#   4  4  <NA>   10     1 <NA>
#   5  5  <NA>    8  <NA>    2

Data:

df <- data.frame(id = 1:5, item = c("apple2milk5", "milk1", "juice3apple5", "egg10juice1", "egg8milk2"), stringsAsFactors = F)
like image 40
RLave Avatar answered Oct 13 '22 08:10

RLave