I'm trying to filter out multiple rows of data that I don't need in R but I'm not sure how to do it.
The data I'm using looks a bit like this:
Category Item Shop1 Shop2 Shop3
1 Fruit Apples 4 6 0
2 Fruit Oranges 0 2 7
3 Veg Potatoes 0 0 0
4 Veg Onions 0 0 0
5 Veg Carrots 0 0 0
6 Dairy Yoghurt 0 0 0
7 Dairy Milk 0 1 0
8 Dairy Cheese 0 0 0
I only want to keep categories where at least one item has a positive value for at least one of the shops.
In this case, I want to get rid of all the Veg
rows, because none of the shops have sold any vegetables. I want to keep all the Fruit
rows, and I want to keep all the Dairy
rows, even those with a value of zero across all shops, because one of the Dairy
rows does have a value above 0.
I tried to use colSums
after using group_by(Category)
in the hope that it would just sum the contents of the Category each time, but it didn't work. I've also tried to add a column at the end for rowSums and to filter based on frequency, but I could only filter out individual rows this way, not rows based on the whole Category.
While I can filter out individual rows that have values of zero (like row 3 for example), my difficulty is keeping in rows like rows 6 and 8 where all the values for each shop are zero, but I want to keep these rows because other Dairy
rows do have values above zero.
1) subset/ave rowSums(...) > 0
has one element for each row. That element is TRUE if there are non-zeros in that row. It assumes that negative values are not possible. (If negative values were possible then use rowSums(DF[-1:-2]^2) > 0
instead.) It also assumes that the shops are those columns past the first two. In particular, it will work for any number of shops. Then ave
produces a TRUE for groups for which any
of those values is TRUE and subset
only keeps those. No packages are used.
subset(DF, ave(rowSums(DF[-1:-2]) > 0, Category, FUN = any))
giving:
Category Item Shop1 Shop2 Shop3
1 Fruit Apples 4 6 0
2 Fruit Oranges 0 2 7
6 Dairy Yoghurt 0 0 0
7 Dairy Milk 0 1 0
8 Dairy Cheese 0 0 0
1a) A variation of this would be the following if you don't mind hard coding the shops:
subset(DF, ave(Shop1 + Shop2 + Shop3 > 0, Category, FUN = any))
2) dplyr
library(dplyr)
DF %>% group_by(Category) %>% filter(any(Shop1, Shop2, Shop3)) %>% ungroup
giving:
# A tibble: 5 x 5
# Groups: Category [2]
Category Item Shop1 Shop2 Shop3
<fctr> <fctr> <int> <int> <int>
1 Fruit Apples 4 6 0
2 Fruit Oranges 0 2 7
3 Dairy Yoghurt 0 0 0
4 Dairy Milk 0 1 0
5 Dairy Cheese 0 0 0
3) Filter/split Another base solution is:
do.call("rbind", Filter(function(x) any(x[-1:-2]), split(DF, DF$Category)))
giving:
Category Item Shop1 Shop2 Shop3
Dairy.6 Dairy Yoghurt 0 0 0
Dairy.7 Dairy Milk 0 1 0
Dairy.8 Dairy Cheese 0 0 0
Fruit.1 Fruit Apples 4 6 0
Fruit.2 Fruit Oranges 0 2 7
4) dplyr/tidyr Use gather
to convert the data to long form where there is one row for each value and then filter the groups using any
. Finally convert back to wide form.
library(dplyr)
library(tidyr)
DF %>%
gather(shop, value, -(Category:Item)) %>%
group_by(Category) %>%
filter(any(value)) %>%
ungroup %>%
spread(shop, value)
giving:
# A tibble: 5 x 5
Category Item Shop1 Shop2 Shop3
* <fctr> <fctr> <int> <int> <int>
1 Dairy Cheese 0 0 0
2 Dairy Milk 0 1 0
3 Dairy Yoghurt 0 0 0
4 Fruit Apples 4 6 0
5 Fruit Oranges 0 2 7
Note: The input in reproducible form is:
Lines <- " Category Item Shop1 Shop2 Shop3
1 Fruit Apples 4 6 0
2 Fruit Oranges 0 2 7
3 Veg Potatoes 0 0 0
4 Veg Onions 0 0 0
5 Veg Carrots 0 0 0
6 Dairy Yoghurt 0 0 0
7 Dairy Milk 0 1 0
8 Dairy Cheese 0 0 0"
DF <- read.table(text = Lines)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With