Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collapsing rows where some are all NA, others are disjoint with some NAs

Tags:

r

aggregate

na

I have a simple dataframe as such:

ID    Col1    Col2    Col3    Col4
1     NA      NA      NA      NA  
1     5       10      NA      NA
1     NA      NA      15      20
2     NA      NA      NA      NA  
2     25      30      NA      NA
2     NA      NA      35      40 

And I would like to reformat it as such:

ID    Col1    Col2    Col3    Col4
1     5       10      15      20
2     25      30      35      40

(please note: the real data set has thousands of rows and the values are from biological data -- the NAs follow no simple pattern, except that the NAs are disjoint, and yes there are exactly 3 rows for each ID).

STEP ONE: get rid of rows that have only NA values.

On the surface this looked simple, but I've run across some problems.

complete.cases(DF) returns all FALSE, so I can't really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA.

Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

STEP TWO: collapse the remaining two rows into one.

Thinking about using something like aggregate to pull this off, but there has got to be an easier way than this, which doesn't work at all.

Thanks for any advice.

like image 690
tumultous_rooster Avatar asked Jan 20 '15 01:01

tumultous_rooster


People also ask

How do I exclude NA rows in R?

To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).

How do I know if there are NA values in a Dataframe in R?

To check which value in NA in an R data frame, we can use apply function along with is.na function. This will return the data frame in logical form with TRUE and FALSE.

How do I check if a column contains Na?

In R, the easiest way to find columns that contain missing values is by combining the power of the functions is.na() and colSums(). First, you check and count the number of NA's per column. Then, you use a function such as names() or colnames() to return the names of the columns with at least one missing value.


1 Answers

Here's a data table approach that uses na.omit() across the columns, grouped by ID.

library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
#    ID Col1 Col2 Col3 Col4
# 1:  1    5   10   15   20
# 2:  2   25   30   35   40
like image 124
Rich Scriven Avatar answered Sep 20 '22 17:09

Rich Scriven