I'm currently working on a dataframe that looks something like this:
Site Spp1 Spp2 Spp3 LOC TYPE
S01 2 4 0 A FLOOD
S02 4 0 0 A REG
....
S10 0 1 0 B FLOOD
S11 1 0 0 B REG
What I'm trying to do is subset the dataframe so I can run some indicator species analysis in R.
The following code works in that I create two subsets of the data, merge them into one frame and then drop the unused factor levels
A.flood <- filter(data, TYPE == "FLOOD", LOC == "A")
B.flood <- filter(data, TYPE == "FLOOD", LOC == "B")
A.B.flood <- rbind(A.flood, B.flood) %>% droplevels.data.frame(A.B.flood, except = c("A", "B"))
What I was also hoping/need to do is to drop all Spp
columns (in my real dataset there are ~ 60) that sum to zero. Is there a way to achieve this this with dplyr, and if there is, is it possible to pipe that code onto the existing A.B.flood
dataframe code?
Thanks!
EDIT
I managed to remove all the columns that summed to zero, by selecting only the columns that summed to > zero:
A.B.flood.subset <- A.B.flood[, apply(A.B.flood[1:(ncol(A.B.flood))], 2, sum)!=0]
dplyr select() function is used to select the column and by using negation of this to remove columns. All verbs in dplyr package take data.
Drop multiple columns by using the column nameWhere, dataframe is the input dataframe and -c(column_names) is the collection of names of the column to be removed.
library(dplyr) df %>% select_if(~ ! any(is.na(.))) Both methods produce the same result.
The most easiest way to drop columns is by using subset() function. In the code below, we are telling R to drop variables x and z. The '-' sign indicates dropping variables. Make sure the variable names would NOT be specified in quotes when using subset() function.
For those who want to use dplyr 1.0.0 with the where
keyword, you can do:
A.B.flood %>%
select(where( ~ is.numeric(.x) && sum(.x) != 0))
returns:
Spp1 Spp2
1 2 4
2 4 0
3 0 0
4 4 0
using the same data given by @akrun:
A.B.flood <- structure(
list(
Site = c("S01", "S02", "S03", "S04"),
Spp1 = c(2L,
4L, 0L, 4L),
Spp2 = c(4L, 0L, 0L, 0L),
Spp3 = c(0L, 0L, 0L, 0L),
LOC = c("A", "A", "A", "A"),
TYPE = c("FLOOD", "REG",
"FLOOD",
"REG")
),
.Names = c("Site", "Spp1", "Spp2", "Spp3", "LOC",
"TYPE"), class = "data.frame", row.names = c(NA, -4L))
I realize this question is now quite old, but I came accross and found another solution using dplyr's "select" and "which", which might seem clearer to dplyr's enthusiasts:
A.B.flood.subset <- A.B.flood %>% select(which(!colSums(A.B.flood, na.rm=TRUE) %in% 0))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With