<p>I have a data.table with a column that has <code>NA</code>s. I want to drop rows where that column takes a particular value (which happens to be <code>""</code>). However, my first attempt lead me to lose rows with <code>NA</code>s as well:</p> <pre class="prettyprint"><code>> a = c(1,"",NA) > x <- data.table(a);x a 1: 1 2: 3: NA > y <- x[a!=""];y a 1: 1 </code></pre> <p>After looking at <code>?`!=`</code>, I found a one liner that works, but it's a pain:</p> <pre class="prettyprint"><code>> z <- x[!sapply(a,function(x)identical(x,""))]; z a 1: 1 2: NA </code></pre> <p>I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-<code>NA</code> values. Here's a bad way:</p> <pre class="prettyprint"><code>> drop_these <- function(these,where){ + argh <- !sapply(where, + function(x)unlist(lapply(as.list(these),function(this)identical(x,this))) + ) + if (is.matrix(argh)){argh <- apply(argh,2,all)} + return(argh) + } > x[drop_these("",a)] a 1: 1 2: NA > x[drop_these(c(1,""),a)] a 1: NA </code></pre> <p>I looked at <code>?J</code> and tried things out with a data.frame, which seems to work differently, keeping <code>NA</code>s when subsetting:</p> <pre class="prettyprint"><code>> w <- data.frame(a,stringsAsFactors=F); w a 1 1 2 3 <NA> > d <- w[a!="",,drop=F]; d a 1 1 NA <NA> </code></pre>

<h3>To provide a solution to your question:</h3> <p>You should use <code>%in%</code>. It gives you back a logical vector.</p> <pre class="prettyprint"><code>a %in% "" # [1] FALSE TRUE FALSE x[!a %in% ""] # a # 1: 1 # 2: NA </code></pre> <hr> <h3>To find out <em>why</em> this is happening in <code>data.table</code>:</h3> <p>(as opposted to <code>data.frame</code>)</p> <p>If you look at the <code>data.table</code> source code on the file <code>data.table.R</code> under the function <code>"[.data.table"</code>, there's a set of <code>if-statements</code> that check for <code>i</code> argument. One of them is:</p> <pre class="prettyprint"><code>if (!missing(i)) { # Part (1) isub = substitute(i) # Part (2) if (is.call(isub) && isub[[1L]] == as.name("!")) { notjoin = TRUE if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch."); nomatch = 0L isub = isub[[2L]] } ..... # "isub" is being evaluated using "eval" to result in a logical vector # Part 3 if (is.logical(i)) { # see DT[NA] thread re recycling of NA logical if (identical(i,NA)) i = NA_integer_ # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB] else i[is.na(i)] = FALSE } .... } </code></pre> <p>To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts. </p> <h3>First, why <code>dt[a != ""]</code> doesn't work as expected (by the OP)?</h3> <p>First, <code>part 1</code> evaluates to an object of class <code>call</code>. The second part of the if statement in <code>part 2</code> returns FALSE. Following that, the <code>call</code> is "evaluated" to give <code>c(TRUE, FALSE, NA)</code> . Then <code>part 3</code> is executed. So, <code>NA</code> is replaced to <code>FALSE</code> (the last line of the logical loop). </p> <h3>why does <code>x[!(a== "")]</code> work as expected (by the OP)?</h3> <p><code>part 1</code> returns a <em>call</em> once again. But, <code>part 2</code> evaluates to TRUE and therefore sets:</p> <pre class="prettyprint"><code>1) `notjoin = TRUE` 2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation) </code></pre> <p>That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class <em>call</em>. So this gets evaluated (using <code>eval</code>) to logical again. So, <code>(a=="")</code> evaluates to <code>c(FALSE, TRUE, NA)</code>. </p> <p>Now, this is checked for <code>is.logical</code> in <code>part 3</code>. So, here, <code>NA</code> gets replaced to <code>FALSE</code>. It therefore becomes, <code>c(FALSE, TRUE, FALSE)</code>. At some point later, a <code>which(c(F,T,F))</code> is executed, which results in 2 here. Because <code>notjoin = TRUE</code> (from <code>part 2</code>) <code>seq_len(nrow(x))[-2]</code> = c(1,3) is returned. so, <code>x[!(a=="")]</code> basically returns <code>x[c(1,3)]</code> which is the desired result. Here's the relevant code snippet:</p> <pre class="prettyprint"><code>if (notjoin) { if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA") irows = irows[irows!=0L] # WHERE MAGIC HAPPENS (returns c(1,3)) i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x)) # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each # column when irows contains negatives. } </code></pre> <p>Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.</p>

subsetting a data.table using !=<some non-NA> excludes NA too

Tags:

r

data.table

I have a data.table with a column that has NAs. I want to drop rows where that column takes a particular value (which happens to be ""). However, my first attempt lead me to lose rows with NAs as well:

> a = c(1,"",NA)
> x <- data.table(a);x
    a
1:  1
2:   
3: NA
> y <- x[a!=""];y
   a
1: 1

After looking at ?`!=`, I found a one liner that works, but it's a pain:

> z <- x[!sapply(a,function(x)identical(x,""))]; z
    a
1:  1
2: NA

I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA values. Here's a bad way:

>     drop_these <- function(these,where){
+         argh <- !sapply(where,
+             function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+         )
+         if (is.matrix(argh)){argh <- apply(argh,2,all)}
+         return(argh)
+     }
>     x[drop_these("",a)]
    a
1:  1
2: NA
>     x[drop_these(c(1,""),a)]
    a
1: NA

I looked at ?J and tried things out with a data.frame, which seems to work differently, keeping NAs when subsetting:

> w <- data.frame(a,stringsAsFactors=F); w
     a
1    1
2     
3 <NA>
> d <- w[a!="",,drop=F]; d
      a
1     1
NA <NA>

417

asked Apr 25 '13 18:04

Frank

1 Answers

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

a %in% ""
# [1] FALSE  TRUE FALSE

x[!a %in% ""]
#     a
# 1:  1
# 2: NA

To find out why this is happening in `data.table`:

(as opposted to data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

if (!missing(i)) {
    # Part (1)
    isub = substitute(i)

    # Part (2)
    if (is.call(isub) && isub[[1L]] == as.name("!")) {
        notjoin = TRUE
        if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
        nomatch = 0L
        isub = isub[[2L]]
    }

    .....
    # "isub" is being evaluated using "eval" to result in a logical vector

    # Part 3
    if (is.logical(i)) {
        # see DT[NA] thread re recycling of NA logical
        if (identical(i,NA)) i = NA_integer_  
        # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        else i[is.na(i)] = FALSE  
    }
    ....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

First, why `dt[a != ""]` doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

why does `x[!(a== "")]` work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

if (notjoin) {
    if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
    irows = irows[irows!=0L]
    # WHERE MAGIC HAPPENS (returns c(1,3))
    i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL  # NULL meaning all rows i.e. seq_len(nrow(x))
    # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
    # column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

answered Sep 27 '22 19:09

Arun

Related questions
                            
                                Fast partial string matching in R
                            
                                Shrink DT::dataTableOutput Size
                            
                                command line arguments in bash to Rscript
                            
                                R equivalent to MATLAB's "stop if error"
                            
                                Why are " preferred over ' in R
                            
                                Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
                            
                                Emoticons in Twitter Sentiment Analysis in r
                            
                                Is there a quick way to get the R equivalent of ls() in Python?
                            
                                export data frames to Excel via xlsx with conditional formatting
                            
                                How to "unmelt" data with reshape r
                            
                                Downloading png from Shiny (R)
                            
                                Associate a color palette with ggplot2 theme
                            
                                Filter dataframe using global variable with the same name as column name [duplicate]
                            
                                Horizontal Rule hr() in R Shiny Sidebar
                            
                                R + plotly: solid of revolution
                            
                                In ESS/Emacs, how can I get the R process buffer to scroll to the bottom after a C-c C-j or C-c C-r
                            
                                Exceeding memory limit in R (even with 24GB RAM)
                            
                                could not find function "cast" despite reshape2 installed and loaded
                            
                                Abbreviation of "collapse" in paste?
                            
                                read.table reads "T" as TRUE and "F" as FALSE, how to avoid?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

subsetting a data.table using !=<some non-NA> excludes NA too

Tags:

r

data.table

Frank

People also ask

1 Answers

To provide a solution to your question:

To find out why this is happening in `data.table`:

First, why `dt[a != ""]` doesn't work as expected (by the OP)?

why does `x[!(a== "")]` work as expected (by the OP)?

Arun

Recent Activity

Donate For Us

subsetting a data.table using !=<some non-NA> excludes NA too

Tags:

r

data.table

Frank

People also ask

1 Answers

To provide a solution to your question:

To find out why this is happening in data.table:

First, why dt[a != ""] doesn't work as expected (by the OP)?

why does x[!(a== "")] work as expected (by the OP)?

Arun

Related questions

Recent Activity

Donate For Us

To find out why this is happening in `data.table`:

First, why `dt[a != ""]` doesn't work as expected (by the OP)?

why does `x[!(a== "")]` work as expected (by the OP)?