I'm trying to filter out multiple rows of data that I don't need in R but I'm not sure how to do it.
The data I'm using looks a bit like this:
Category Item Shop1 Shop2 Shop3
1 Fruit Apples 4 6 0
2 Fruit Oranges 0 2 7
3 Veg Potatoes 0 0 0
4 Veg Onions 0 0 0
5 Veg Carrots 0 0 0
6 Dairy Yoghurt 0 0 0
7 Dairy Milk 0 1 0
8 Dairy Cheese 0 0 0
I only want to keep categories where at least one item has a positive value for at least one of the shops.
In this case, I want to get rid of all the Veg rows, because none of the shops have sold any vegetables. I want to keep all the Fruit rows, and I want to keep all the Dairy rows, even those with a value of zero across all shops, because one of the Dairy rows does have a value above 0.
I tried to use colSums after using group_by(Category) in the hope that it would just sum the contents of the Category each time, but it didn't work. I've also tried to add a column at the end for rowSums and to filter based on frequency, but I could only filter out individual rows this way, not rows based on the whole Category.
While I can filter out individual rows that have values of zero (like row 3 for example), my difficulty is keeping in rows like rows 6 and 8 where all the values for each shop are zero, but I want to keep these rows because other Dairy rows do have values above zero.
1) subset/ave rowSums(...) > 0 has one element for each row. That element is TRUE if there are non-zeros in that row. It assumes that negative values are not possible. (If negative values were possible then use rowSums(DF[-1:-2]^2) > 0 instead.) It also assumes that the shops are those columns past the first two. In particular, it will work for any number of shops. Then ave produces a TRUE for groups for which any of those values is TRUE and subset only keeps those. No packages are used.
subset(DF, ave(rowSums(DF[-1:-2]) > 0, Category, FUN = any))
giving:
Category Item Shop1 Shop2 Shop3
1 Fruit Apples 4 6 0
2 Fruit Oranges 0 2 7
6 Dairy Yoghurt 0 0 0
7 Dairy Milk 0 1 0
8 Dairy Cheese 0 0 0
1a) A variation of this would be the following if you don't mind hard coding the shops:
subset(DF, ave(Shop1 + Shop2 + Shop3 > 0, Category, FUN = any))
2) dplyr
library(dplyr)
DF %>% group_by(Category) %>% filter(any(Shop1, Shop2, Shop3)) %>% ungroup
giving:
# A tibble: 5 x 5
# Groups: Category [2]
Category Item Shop1 Shop2 Shop3
<fctr> <fctr> <int> <int> <int>
1 Fruit Apples 4 6 0
2 Fruit Oranges 0 2 7
3 Dairy Yoghurt 0 0 0
4 Dairy Milk 0 1 0
5 Dairy Cheese 0 0 0
3) Filter/split Another base solution is:
do.call("rbind", Filter(function(x) any(x[-1:-2]), split(DF, DF$Category)))
giving:
Category Item Shop1 Shop2 Shop3
Dairy.6 Dairy Yoghurt 0 0 0
Dairy.7 Dairy Milk 0 1 0
Dairy.8 Dairy Cheese 0 0 0
Fruit.1 Fruit Apples 4 6 0
Fruit.2 Fruit Oranges 0 2 7
4) dplyr/tidyr Use gather to convert the data to long form where there is one row for each value and then filter the groups using any. Finally convert back to wide form.
library(dplyr)
library(tidyr)
DF %>%
gather(shop, value, -(Category:Item)) %>%
group_by(Category) %>%
filter(any(value)) %>%
ungroup %>%
spread(shop, value)
giving:
# A tibble: 5 x 5
Category Item Shop1 Shop2 Shop3
* <fctr> <fctr> <int> <int> <int>
1 Dairy Cheese 0 0 0
2 Dairy Milk 0 1 0
3 Dairy Yoghurt 0 0 0
4 Fruit Apples 4 6 0
5 Fruit Oranges 0 2 7
Note: The input in reproducible form is:
Lines <- " Category Item Shop1 Shop2 Shop3
1 Fruit Apples 4 6 0
2 Fruit Oranges 0 2 7
3 Veg Potatoes 0 0 0
4 Veg Onions 0 0 0
5 Veg Carrots 0 0 0
6 Dairy Yoghurt 0 0 0
7 Dairy Milk 0 1 0
8 Dairy Cheese 0 0 0"
DF <- read.table(text = Lines)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With