Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete columns/rows with more than x% missing

Tags:

r

dplyr

I want to delete all columns or rows with more than 50% NAs in a data frame.

This is my solution:

# delete columns with more than 50% missings miss <- c() for(i in 1:ncol(data)) {   if(length(which(is.na(data[,i]))) > 0.5*nrow(data)) miss <- append(miss,i)  } data2 <- data[,-miss]   # delete rows with more than 50% percent missing miss2 <- c() for(i in 1:nrow(data)) {   if(length(which(is.na(data[i,]))) > 0.5*ncol(data)) miss2 <- append(miss2,i)  } data <- data[-miss,] 

but I'm looking for a nicer/faster solution.

I would also appreciate a dplyr solution

like image 887
spore234 Avatar asked Aug 06 '15 06:08

spore234


People also ask

Why do we remove variables with a high missing value ratio?

In the case of multivariate analysis, if there is a larger number of missing values, then it can be better to drop those cases (rather than do imputation) and replace them. On the other hand, in univariate analysis, imputation can decrease the amount of bias in the data, if the values are missing at random.

How do you exclude a column with missing values in Python?

The dropna() function is used to remove missing values. Determine if rows or columns which contain missing values are removed. 0, or 'index' : Drop rows which contain missing values.

What should be the allowed percentage of missing values?

The overall percentage of data that is missing is important. Generally, if less than 5% of values are missing then it is acceptable to ignore them (REF). However, the overall percentage missing alone is not enough; you also need to pay attention to which data is missing.

How do you delete a row with missing values in Python?

DataFrame. dropna() also gives you the option to remove the rows by searching for null or missing values on specified columns. To search for null values in specific columns, pass the column names to the subset parameter. It can take a list of column names or column positions.


1 Answers

To remove columns with some amount of NA, you can use colMeans(is.na(...))

## Some sample data set.seed(0) dat <- matrix(1:100, 10, 10) dat[sample(1:100, 50)] <- NA dat <- data.frame(dat)  ## Remove columns with more than 50% NA dat[, which(colMeans(!is.na(dat)) > 0.5)]  ## Remove rows with more than 50% NA dat[which(rowMeans(!is.na(dat)) > 0.5), ]  ## Remove columns and rows with more than 50% NA dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)] 
like image 119
Rorschach Avatar answered Sep 21 '22 07:09

Rorschach