Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter rows in R based on values in multiple rows

Tags:

dataframe

r

dplyr

I'm trying to filter out multiple rows of data that I don't need in R but I'm not sure how to do it.

The data I'm using looks a bit like this:

  Category     Item Shop1 Shop2 Shop3
1    Fruit   Apples     4     6     0
2    Fruit  Oranges     0     2     7
3      Veg Potatoes     0     0     0
4      Veg   Onions     0     0     0
5      Veg  Carrots     0     0     0
6    Dairy  Yoghurt     0     0     0
7    Dairy     Milk     0     1     0
8    Dairy   Cheese     0     0     0

I only want to keep categories where at least one item has a positive value for at least one of the shops.

In this case, I want to get rid of all the Veg rows, because none of the shops have sold any vegetables. I want to keep all the Fruit rows, and I want to keep all the Dairy rows, even those with a value of zero across all shops, because one of the Dairy rows does have a value above 0.

I tried to use colSums after using group_by(Category) in the hope that it would just sum the contents of the Category each time, but it didn't work. I've also tried to add a column at the end for rowSums and to filter based on frequency, but I could only filter out individual rows this way, not rows based on the whole Category.

While I can filter out individual rows that have values of zero (like row 3 for example), my difficulty is keeping in rows like rows 6 and 8 where all the values for each shop are zero, but I want to keep these rows because other Dairy rows do have values above zero.

like image 406
Rose Avatar asked Dec 13 '22 21:12

Rose


1 Answers

1) subset/ave rowSums(...) > 0 has one element for each row. That element is TRUE if there are non-zeros in that row. It assumes that negative values are not possible. (If negative values were possible then use rowSums(DF[-1:-2]^2) > 0 instead.) It also assumes that the shops are those columns past the first two. In particular, it will work for any number of shops. Then ave produces a TRUE for groups for which any of those values is TRUE and subset only keeps those. No packages are used.

subset(DF, ave(rowSums(DF[-1:-2]) > 0, Category, FUN = any))

giving:

  Category    Item Shop1 Shop2 Shop3
1    Fruit  Apples     4     6     0
2    Fruit Oranges     0     2     7
6    Dairy Yoghurt     0     0     0
7    Dairy    Milk     0     1     0
8    Dairy  Cheese     0     0     0

1a) A variation of this would be the following if you don't mind hard coding the shops:

subset(DF, ave(Shop1 + Shop2 + Shop3 > 0, Category, FUN = any))

2) dplyr

library(dplyr)
DF %>% group_by(Category) %>% filter(any(Shop1, Shop2, Shop3)) %>% ungroup

giving:

# A tibble: 5 x 5
# Groups:   Category [2]
  Category    Item Shop1 Shop2 Shop3
    <fctr>  <fctr> <int> <int> <int>
1    Fruit  Apples     4     6     0
2    Fruit Oranges     0     2     7
3    Dairy Yoghurt     0     0     0
4    Dairy    Milk     0     1     0
5    Dairy  Cheese     0     0     0

3) Filter/split Another base solution is:

do.call("rbind", Filter(function(x) any(x[-1:-2]), split(DF, DF$Category)))

giving:

        Category    Item Shop1 Shop2 Shop3
Dairy.6    Dairy Yoghurt     0     0     0
Dairy.7    Dairy    Milk     0     1     0
Dairy.8    Dairy  Cheese     0     0     0
Fruit.1    Fruit  Apples     4     6     0
Fruit.2    Fruit Oranges     0     2     7

4) dplyr/tidyr Use gather to convert the data to long form where there is one row for each value and then filter the groups using any. Finally convert back to wide form.

library(dplyr)
library(tidyr)
DF %>% 
  gather(shop, value, -(Category:Item)) %>% 
  group_by(Category) %>% 
  filter(any(value)) %>% 
  ungroup %>% 
  spread(shop, value)

giving:

# A tibble: 5 x 5
  Category    Item Shop1 Shop2 Shop3
*   <fctr>  <fctr> <int> <int> <int>
1    Dairy  Cheese     0     0     0
2    Dairy    Milk     0     1     0
3    Dairy Yoghurt     0     0     0
4    Fruit  Apples     4     6     0
5    Fruit Oranges     0     2     7

Note: The input in reproducible form is:

Lines <- "  Category     Item Shop1 Shop2 Shop3
1    Fruit   Apples     4     6     0
2    Fruit  Oranges     0     2     7
3      Veg Potatoes     0     0     0
4      Veg   Onions     0     0     0
5      Veg  Carrots     0     0     0
6    Dairy  Yoghurt     0     0     0
7    Dairy     Milk     0     1     0
8    Dairy   Cheese     0     0     0"

DF <- read.table(text = Lines)
like image 58
G. Grothendieck Avatar answered Dec 16 '22 10:12

G. Grothendieck