Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete columns from a data.table based on values in column

Tags:

r

data.table

Background

I have some financial data (1.5 years SP500 stocks) that I have manipulated into a wide format using the data.table package. After following the whole data.table course on Datacamp, I'm starting to get a hang of the basics, but after searching for hours I'm at a loss on how to do this.

The Problem

The data contains columns with financial data for each stock. I need to delete columns that contain two consecutive NAs.

My guess is I have to use rle(), lapply(), to find consecutive values and DT[,x:= NULL]) to delete the columns.

I read that rle() doesn't work on NAs, so I changed them to Inf instead. I just don't know how to combine the functions so that I can efficiently remove a few columns among the 460 that I have.

An answer using data.table would be great, but anything that works well is very much appreciated.

Alternatively I would love to know how to remove columns containing at least 1 NA

Example data

> test[1:5,1:5,with=FALSE]
         date     10104     10107     10138     10145
1: 2012-07-02  0.003199       Inf  0.001112 -0.012178
2: 2012-07-03  0.005873  0.006545  0.001428       Inf
3: 2012-07-05       Inf -0.001951 -0.011090       Inf
4: 2012-07-06       Inf -0.016775 -0.009612       Inf
5: 2012-07-09 -0.002742 -0.006129 -0.001294  0.005830
> dim(test)
[1] 377 461

Desired outcome

         date     10107     10138
1: 2012-07-02       Inf  0.001112
2: 2012-07-03  0.006545  0.001428
3: 2012-07-05 -0.001951 -0.011090
4: 2012-07-06 -0.016775 -0.009612
5: 2012-07-09 -0.006129 -0.001294

PS. This is my first question, I have tried to adhere to the rules, if I need to change anything please let me know.

like image 622
Cristopher van der Kooij Avatar asked Mar 15 '23 17:03

Cristopher van der Kooij


2 Answers

Here's an rle version:

dt[, sapply(dt, function(x)
       setDT(rle(is.na(x)))[, sum(lengths > 1 & values) == 0]), with = F]

Or replace the is.na with is.infinite if you like.

like image 103
eddi Avatar answered Mar 18 '23 15:03

eddi


To detect and delete columns containing atleast one NA, you can try the following

data = data.frame(A=c(1,2,3,4,5), B=c(2,3,4,NA,6), C=c(3,4,5,6,7), D=c(4,5,NA,NA,8))

colsToDelete = lapply(data, FUN = function(x){ sum(is.na(x)) >= 1 })

data.formatted = data[,c(!unlist(colsToDelete))]
like image 21
SKG Avatar answered Mar 18 '23 15:03

SKG