I'm working on a data set that includes community data, and many of the columns (species) have a lot of zeroes. I would like to be able to drop these columns for some of the analyses I'm doing, based on the sum of the whole column.
I'm tempted to do this with a for loop, but I hear that the apply and by functions are better when you're using R.
My goal is to remove all columns with a sum of less than 15.
I have used which()
to remove rows by factors, e.g.,
September<-which(data$Time_point=="September")
data<-data[-September,]
and the two ways I've tried removing columns is by using apply()
:
data<-data[,apply(data,2,function(x)sum(x<=15))]
and by using a messy for loop/if else combo:
for (i in 6:length(data)){
if (sum(data[,i])<=15)
data[,i]<-NULL
else
data[,i]<-data[,i]
}
Neither of these methods has been working. Surely there is an elegant way to get rid of columns based on logical criteria?
str(head(data,10))
'data.frame': 10 obs. of 23 variables:
$ Core_num : Factor w/ 159 levels "152","153","154",..: 133 72 70 75 89 85 86 90 95 99
$ Cage_num : num 0 1 2 3 4 5 6 7 8 9
$ Treatment : Factor w/ 4 levels "","C","CC","NC": 1 2 2 2 2 2 2 2 2 2
$ Site : Factor w/ 10 levels "","B","B07","B08",..: 1 8 8 8 7 7 7 7 9 9
$ Time_point : Factor w/ 3 levels "","May","September": 1 2 2 2 2 2 2 2 2 2
$ Spionidae : num 108 0 0 0 0 0 0 0 0 0
$ Syllidae : num 185 0 0 0 3 8 0 1 4 1
$ Opheliidae : num 424 0 1 0 0 0 1 1 0 0
$ Cossuridae : num 164 0 7 3 0 0 0 0 0 0
$ Sternaspidae: num 214 0 0 6 1 0 11 9 0 0
$ Sabellidae : num 1154 0 2 2 0 ...
$ Capitellidae: num 256 1 10 17 0 3 0 0 0 0
$ Dorvillidae : num 21 1 0 0 0 0 0 0 0 0
$ Cirratulidae: num 17 0 0 0 0 0 0 0 0 0
$ Oligochaeta : num 3747 12 41 27 32 ...
$ Nematoda : num 410 5 4 13 0 0 0 2 2 0
$ Sipuncula : num 33 0 0 0 0 0 0 0 0 0
$ Ostracoda : num 335 0 1 0 0 0 0 0 0 0
$ Decapoda : num 62 0 4 0 1 0 0 0 0 0
$ Amphipoda : num 2789 75 17 34 89 ...
$ Copepoda : num 75 0 0 0 0 0 0 0 0 0
$ Tanaidacea : num 84 0 0 0 1 0 0 0 0 0
$ Mollusca : int 55 0 4 0 0 0 0 0 0 0
You can use the following syntax to exclude columns in a pandas DataFrame: #exclude column1 df. loc[:, df. columns!='
To sum given or list of columns then create a list with all columns you wanted and slice the DataFrame with the selected list of columns and use the sum() function. Use df['Sum']=df[col_list]. sum(axis=1) to get the total sum.
What about a simple subset? First, we create a simple data frameL
R> dd = data.frame(x = runif(5), y = 20*runif(5), z=20*runif(5))
Then select the columns where the sum is greater than 15
R> dd1 = dd[,colSums(dd) > 15]
R> ncol(dd1)
[1] 2
In your data set, you only want to subset columns 6 onwards, so something like:
##Drop the first five columns
dd[,colSums(dd[,6:ncol(dd)]) > 15]
or
#Keep the first six columns
cols_to_drop = c(rep(TRUE, 5), dd[,6:ncol(dd)]>15)
dd[,cols_to_drop]
should work.
The key part to note is that in the square brackets, we want a vector of logicals, i.e. a vector of TRUE and FALSE. So if you wanted to subset using something a bit more complicated, then create a function that returns TRUE or FALSE and subset as usual.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With