Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excluding columns from a dataframe based on column sums

Tags:

I'm working on a data set that includes community data, and many of the columns (species) have a lot of zeroes. I would like to be able to drop these columns for some of the analyses I'm doing, based on the sum of the whole column. I'm tempted to do this with a for loop, but I hear that the apply and by functions are better when you're using R. My goal is to remove all columns with a sum of less than 15. I have used which() to remove rows by factors, e.g.,

September<-which(data$Time_point=="September")

data<-data[-September,] 

and the two ways I've tried removing columns is by using apply():

data<-data[,apply(data,2,function(x)sum(x<=15))]

and by using a messy for loop/if else combo:

for (i in 6:length(data)){
    if (sum(data[,i])<=15)
    data[,i]<-NULL
    else 
    data[,i]<-data[,i]
    }

Neither of these methods has been working. Surely there is an elegant way to get rid of columns based on logical criteria?

str(head(data,10))
'data.frame':   10 obs. of  23 variables:
 $ Core_num    : Factor w/ 159 levels "152","153","154",..: 133 72 70 75 89 85 86 90 95 99
 $ Cage_num    : num  0 1 2 3 4 5 6 7 8 9
 $ Treatment   : Factor w/ 4 levels "","C","CC","NC": 1 2 2 2 2 2 2 2 2 2
 $ Site        : Factor w/ 10 levels "","B","B07","B08",..: 1 8 8 8 7 7 7 7 9 9
 $ Time_point  : Factor w/ 3 levels "","May","September": 1 2 2 2 2 2 2 2 2 2
 $ Spionidae   : num  108 0 0 0 0 0 0 0 0 0
 $ Syllidae    : num  185 0 0 0 3 8 0 1 4 1
 $ Opheliidae  : num  424 0 1 0 0 0 1 1 0 0
 $ Cossuridae  : num  164 0 7 3 0 0 0 0 0 0
 $ Sternaspidae: num  214 0 0 6 1 0 11 9 0 0
 $ Sabellidae  : num  1154 0 2 2 0 ...
 $ Capitellidae: num  256 1 10 17 0 3 0 0 0 0
 $ Dorvillidae : num  21 1 0 0 0 0 0 0 0 0
 $ Cirratulidae: num  17 0 0 0 0 0 0 0 0 0
 $ Oligochaeta : num  3747 12 41 27 32 ...
 $ Nematoda    : num  410 5 4 13 0 0 0 2 2 0
 $ Sipuncula   : num  33 0 0 0 0 0 0 0 0 0
 $ Ostracoda   : num  335 0 1 0 0 0 0 0 0 0
 $ Decapoda    : num  62 0 4 0 1 0 0 0 0 0
 $ Amphipoda   : num  2789 75 17 34 89 ...
 $ Copepoda    : num  75 0 0 0 0 0 0 0 0 0
 $ Tanaidacea  : num  84 0 0 0 1 0 0 0 0 0
 $ Mollusca    : int  55 0 4 0 0 0 0 0 0 0
like image 412
Margaret Avatar asked May 15 '12 20:05

Margaret


People also ask

How do you exclude columns from a DataFrame?

You can use the following syntax to exclude columns in a pandas DataFrame: #exclude column1 df. loc[:, df. columns!='

How do I sum specific columns in a data frame?

To sum given or list of columns then create a list with all columns you wanted and slice the DataFrame with the selected list of columns and use the sum() function. Use df['Sum']=df[col_list]. sum(axis=1) to get the total sum.


1 Answers

What about a simple subset? First, we create a simple data frameL

R> dd = data.frame(x = runif(5), y = 20*runif(5), z=20*runif(5))

Then select the columns where the sum is greater than 15

R> dd1 = dd[,colSums(dd) > 15]
R> ncol(dd1)
[1] 2

In your data set, you only want to subset columns 6 onwards, so something like:

 ##Drop the first five columns
 dd[,colSums(dd[,6:ncol(dd)]) > 15]

or

 #Keep the first six columns
 cols_to_drop = c(rep(TRUE, 5), dd[,6:ncol(dd)]>15)
 dd[,cols_to_drop]

should work.


The key part to note is that in the square brackets, we want a vector of logicals, i.e. a vector of TRUE and FALSE. So if you wanted to subset using something a bit more complicated, then create a function that returns TRUE or FALSE and subset as usual.

like image 65
csgillespie Avatar answered Sep 18 '22 17:09

csgillespie