Delete columns/rows with more than x% missing

Tags:

r

dplyr

I want to delete all columns or rows with more than 50% NAs in a data frame.

This is my solution:

# delete columns with more than 50% missings miss <- c() for(i in 1:ncol(data)) {   if(length(which(is.na(data[,i]))) > 0.5*nrow(data)) miss <- append(miss,i)  } data2 <- data[,-miss]   # delete rows with more than 50% percent missing miss2 <- c() for(i in 1:nrow(data)) {   if(length(which(is.na(data[i,]))) > 0.5*ncol(data)) miss2 <- append(miss2,i)  } data <- data[-miss,]

but I'm looking for a nicer/faster solution.

I would also appreciate a dplyr solution

887

asked Aug 06 '15 06:08

spore234

1 Answers

To remove columns with some amount of NA, you can use colMeans(is.na(...))

## Some sample data set.seed(0) dat <- matrix(1:100, 10, 10) dat[sample(1:100, 50)] <- NA dat <- data.frame(dat)  ## Remove columns with more than 50% NA dat[, which(colMeans(!is.na(dat)) > 0.5)]  ## Remove rows with more than 50% NA dat[which(rowMeans(!is.na(dat)) > 0.5), ]  ## Remove columns and rows with more than 50% NA dat[which(rowMeans(!is.na(dat)) > 0.5), which(colMeans(!is.na(dat)) > 0.5)]

119

answered Sep 21 '22 07:09

Rorschach

Related questions
                            
                                Using %>% operator from dplyr without loading dplyr in R
                            
                                Is there a function like switch which works inside of dplyr::mutate?
                            
                                Align multiple plots in ggplot2 when some have legends and others don't
                            
                                How to sort letters in a string?
                            
                                Reading all scripts and data files from multiple folders
                            
                                Change the thousands separator in a ggplot
                            
                                Floor a year to the decade in R
                            
                                Plotting a grid behind data, not in front in R
                            
                                R - could not find function 'melt()' [duplicate]
                            
                                Suppress convergence message in nnet multinom function in R
                            
                                Position geom_text in the middle of each bar segment in a geom_col stacked barchart [duplicate]
                            
                                Do you use attach() or call variables by name or slicing?
                            
                                jitter geom_line()
                            
                                Merge three different columns into a date in R
                            
                                Matching multiple patterns
                            
                                Forecasting time series data
                            
                                Merging multiple rasters in R
                            
                                What is the right way to multiply data frame by vector?
                            
                                How to adjust facet size manually
                            
                                R: How to filter/subset a sequence of dates

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With