I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful. When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG: <pre class="prettyprint"><code>example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z")) example var1 var2 1 A X 2 B Y 3 A Z </code></pre> then I run: <pre class="prettyprint"><code>example[example$var1=="A",] var1 var2 1 A X 3 A Z NA<NA> <NA> </code></pre> Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data. Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting. Thanks

Wrap the condition in <code>which</code>: <pre class="prettyprint"><code>df[which(df$number1 < df$number2), ] </code></pre> <hr> How it works: It returns the row numbers where the condition matches (where the condition is <code>TRUE</code>) and subsets the data frame on those rows accordingly. Say that: <pre class="prettyprint"><code>which(df$number1 < df$number2) </code></pre> returns row numbers <code>1</code>, <code>2</code>, <code>3</code>, <code>4</code> and <code>5</code>. As such, writing: <pre class="prettyprint"><code>df[which(df$number1 < df$number2), ] </code></pre> is the same as writing: <pre class="prettyprint"><code>df[c(1, 2, 3, 4, 5), ] </code></pre> Or an even simpler version is: <pre class="prettyprint"><code>df[1:5, ] </code></pre>

I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way). First of all, some sample data: <pre class="prettyprint"><code>> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA)) > df name number1 number2 1 A 1 10 2 B 2 9 3 C 3 8 4 D 4 7 5 E 5 6 6 F 6 5 7 G 7 4 8 H 8 3 9 I 9 NA 10 J 10 NA </code></pre> Now for a simple filter: <pre class="prettyprint"><code>> df[df$number1 < df$number2, ] name number1 number2 1 A 1 10 2 B 2 9 3 C 3 8 4 D 4 7 5 E 5 6 NA <NA> NA NA NA.1 <NA> NA NA </code></pre> The problem here is that the presence of <code>NA</code>s in the third column causes R to rewrite the whole row as <code>NA</code>. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the <code>NA</code>s: <pre class="prettyprint"><code>> df[df$number1 < df$number2 & !is.na(df$number2), ] name number1 number2 1 A 1 10 2 B 2 9 3 C 3 8 4 D 4 7 5 E 5 6 </code></pre>

Subsetting R data frame results in mysterious NA rows

Tags:

r

na

reshape

subset

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.

When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:

example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z")) example    var1 var2 1    A    X 2    B    Y 3    A    Z

then I run:

example[example$var1=="A",]    var1 var2 1    A    X 3    A    Z NA<NA> <NA>

Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.

Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.

Thanks

200

asked Jan 10 '13 15:01

chrisg

2 Answers

Wrap the condition in which:

df[which(df$number1 < df$number2), ]

How it works:

It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.

Say that:

which(df$number1 < df$number2)

returns row numbers 1, 2, 3, 4 and 5.

As such, writing:

df[which(df$number1 < df$number2), ]

is the same as writing:

df[c(1, 2, 3, 4, 5), ]

Or an even simpler version is:

df[1:5, ]

153

answered Oct 07 '22 08:10

c-urchin

I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).

First of all, some sample data:

> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA)) > df    name number1 number2 1     A       1      10 2     B       2       9 3     C       3       8 4     D       4       7 5     E       5       6 6     F       6       5 7     G       7       4 8     H       8       3 9     I       9      NA 10    J      10      NA

Now for a simple filter:

> df[df$number1 < df$number2, ]      name number1 number2 1       A       1      10 2       B       2       9 3       C       3       8 4       D       4       7 5       E       5       6 NA   <NA>      NA      NA NA.1 <NA>      NA      NA

The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:

> df[df$number1 < df$number2 & !is.na(df$number2), ]   name number1 number2 1    A       1      10 2    B       2       9 3    C       3       8 4    D       4       7 5    E       5       6

answered Oct 07 '22 10:10

Waldir Leoncio

Related questions
                            
                                rmarkdown: pandoc: pdflatex not found
                            
                                Export CSV without col.names
                            
                                R expand.grid() function in Python
                            
                                R Markdown - changing font size and font type in html output
                            
                                geom_rect and alpha - does this work with hard coded values?
                            
                                Formatting dates with scale_x_date in ggplot2
                            
                                How to convert a huge list-of-vector to a matrix more efficiently?
                            
                                How to repeat a String N times in R?
                            
                                How to sort a data frame by date
                            
                                Read a text file in R line by line
                            
                                Understanding dates and plotting a histogram with ggplot2 in R
                            
                                ggplot2: facet_wrap strip color based on variable in data set
                            
                                R knitr Markdown: Output Plots within For Loop
                            
                                R Plotting confidence bands with ggplot
                            
                                How to automatically include all 2-way interactions in a glm model in R
                            
                                Locate the ".Rprofile" file generating default options
                            
                                List of ggplot2 theme options?
                            
                                The R %in% operator
                            
                                Global variables in packages in R
                            
                                Insert a character at a specific location in a string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With