Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Row numbers differ (NA vs 1) when adding first row to empty data.frame

Tags:

dataframe

r

I'd like to understand why these two methods for indexing an empty data.frame result in an NA row number being assigned to the first row only:

Method 1:

df <- data.frame(Number=numeric(), Text=character(), stringsAsFactors = FALSE)
df[1,]$Number <- 123456
df[1,]$Text <- "abcdef"
df[2,]$Number <- 456789
df[2,]$Text <- "abcdef"

Output 1:

> df
   Number   Text
NA 123456 abcdef
2  456789 abcdef

Method 2:

df <- data.frame(Number=numeric(), Text=character(), stringsAsFactors = FALSE)
df[1,1] <- 123456
df[1,2] <- "abcdef"
df[2,1] <- 456789
df[2,2] <- "abcdef"

Output 2:

> df
  Number   Text
1 123456 abcdef
2 456789 abcdef

The only difference I see is that the first method accesses the data.frame using the column name instead of the column number, but I don't see the reason why this results in an NA row number being assigned to the first observation only since the row numbers seem to work as expected from the second row onwards.

like image 495
NewUser Avatar asked Jul 17 '18 17:07

NewUser


People also ask

How do you create an empty DataFrame with rows and columns in R?

One simple approach to creating an empty DataFrame in the R programming language is by using data. frame() method without any params. This creates an R DataFrame without rows and columns (0 rows and 0 columns).

What is the function to set row numbers for data frames?

`. rowNamesDF<-` is a (non-generic replacement) function to set row names for data frames, with extra argument make.

How do you specify the number of rows in a DataFrame in R?

To get number of rows in R Data Frame, call the nrow() function and pass the data frame as argument to this function. nrow() is a function in R base package.


1 Answers

Well, the most important part of this answer is that code like this should be avoided. It is very inefficient to add data to a data.frame in a R row-by-row (see Circle 2 of the R Inferno) . There are almost always better ways to do this depending on what exactly are you doing.

But in getting to what's going on here. All of this comes down to the $.data.frame<-, [.data.frame, and [<-.data.frame functions. In the first case, with

df[1,]$Number <- 123456

you are doing the subset first which calls [<-.data.frame. When you ask for a row of a data.frame that doesn't exist, you get a bunch of NA values for everything (including row names). So now you have an empty data.frame with NA values in the columns and row names. Now you call $<-.data.frame to just update the Number column. You don't update the row numbers. This new value then get's passed to [<-.data.frame to merge it back into the data.frame. When this command runs, it checks to make sure that there are no duplicated row names. For the first row, since there's only one row and it has the name NA, that name is kept. However when there are duplicate names, the function replaces those values with the index of the row numbers. That's why you get an NA for the first row, but when it tries to add the next row, it tried NA again, but sees that's a duplicate so it has to choose a new name. (See what happens when you try df[1:2,]$Number <- 123456 then df[3,]$Number <- 456789)

On the other hand, when you do

df[1,1] <- 123456

That doesn't do the subsetting first to create a row with missing row names. you go right to assignment skipping $.data.frame<- and [.data.frame. In this case, it doesn't have to merge in a new row with an NA row name, it can create the row right away and assign a row name. This is just a special property of calling the assignment operator with having to do the extraction first. You can put the debugger on with debug(`[<-.data.frame`) to see exactly how that happens.

So the first method is basically doing three steps: 1) extact df[1,], 2) change the value of the number column, then 3) merge that new value back into df[1,]. The second method skips the first to steps and is just directly merging values into df[1,]. And the real difference is just how each of those functions choose row names for rows that don't exist yet.

like image 69
MrFlick Avatar answered Oct 16 '22 08:10

MrFlick