Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Behavior of <- NULL on lists versus data.frames for removing data

Tags:

dataframe

r

Many R users eventually figure out lots of ways to remove elements from their data. One way is to use NULL, particularly when you want to do something like drop a column from a data.frame or drop an element from a list.

Eventually, a user comes across a situation where they want to drop several columns from a data.frame at once, and they hit upon <- list(NULL) as the solution (since using <- NULL will result in an error).

A data.frame is a special type of list, so it wouldn't be too tough to imagine that the approaches for removing items from a list should be the same as removing columns from a data.frame. However, they produce different results, as can be seen in the example below.

## Make some small data--two data.frames and two lists
cars1 <- cars2 <- head(mtcars)[1:4]
cars3 <- cars4 <- as.list(cars2)

## Demonstration that the `list(NULL)` approach works
cars1[c("mpg", "cyl")] <- list(NULL)
cars1
#                   disp  hp
# Mazda RX4          160 110
# Mazda RX4 Wag      160 110
# Datsun 710         108  93
# Hornet 4 Drive     258 110
# Hornet Sportabout  360 175
# Valiant            225 105

## Demonstration that simply using `NULL` does not work
cars2[c("mpg", "cyl")] <- NULL
# Error in `[<-.data.frame`(`*tmp*`, c("mpg", "cyl"), value = NULL) : 
#   replacement has 0 items, need 12

Switch to applying the same concept to a list, and compare the difference in behavior.

## Does not fully drop the items, but sets them to `NULL`
cars3[c("mpg", "cyl")] <- list(NULL)
# $mpg
# NULL
# 
# $cyl
# NULL
# 
# $disp
# [1] 160 160 108 258 360 225
# 
# $hp
# [1] 110 110  93 110 175 105

## *Does* drop the `list` items while this would
##   have produced an error with a `data.frame`
cars4[c("mpg", "cyl")] <- NULL
# $disp
# [1] 160 160 108 258 360 225
# 
# $hp
# [1] 110 110  93 110 175 105

The main questions I have are, if a data.frame is a list, why does it behave so differently in this scenario? Is there a foolproof way of knowing when an element will be dropped, when it will produce an error, and when it will simply be given a NULL value? Or do we depend on trial-and-error for this?

like image 833
A5C1D2H2I1M1N2O1R2T1 Avatar asked Oct 17 '13 18:10

A5C1D2H2I1M1N2O1R2T1


People also ask

In which way does a data frame differ from a list?

Lists can have components of the same type or mode, or components of different types or modes. They can hence combine different components (numeric, logical…) in a single object. A Data frame is simply a List of a specified class called “data.

Can a data frame contain lists?

Data frame columns can contain lists Taking into account the list structure of the column, we can type the following to change the values in a single cell. You can also create a data frame having a list as a column using the data.

What are data frames used for?

A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.

What is data frame in statistics?

Data Frames are data displayed in a format as a table. Data Frames can have different types of data inside it. While the first column can be character , the second and third can be numeric or logical . However, each column should have the same type of data.


1 Answers

DISCLAIMER : This is a relatively long answer, not very clear, and not very interesting, so feel free to skip it or to only read the (sort of) conclusion.

I've tried a bit of tracing on [<-.data.frame, as suggested by Ari B. Friedman. Debugging starts on line 162 of the function, where there is a test to determine if value (the replacement value argument) is not a list.

Case 1 : value is not a list

Then it is considered as a vector. Matrices and arrays are considered as one vector, like the help page says :

Note that when the replacement value is an array (including a matrix) it is not treated as a series of columns (as 'data.frame’ and ‘as.data.frame’ do) but inserted as a single column.

If only one column of the data frame is selected in the LHS, then the only constraint is that the number of rows to be replaced must be equal to or a multiple of length(value). If this is the case, value is recycled with rep if necessary and converted to a list. If length(value)==0, there is no recycling (as it is impossible), and value is just converted to a list.

If several columns of the data frame are selected in the LHS, then the constraint is a bit more complex : length(value) must be equal to or a multiple of the total number of elements to be replaced, ie the number of rows * the number of columns.

The exact test is the following :

(m < n * p && (m == 0L || (n * p)%%m))

Where n is the number of rows, p the number of columns, and m the length of value. If the condition is FALSE, then value is converted into an n x p matrix (thus recycled if necessary) and the matrix is splitted by columns into a list.

If value is NULL, then the condition is TRUE as m==0, and the function is stopped. Note that the problem occurs for every value of length 0. For example,

cars1[,c("mpg")] <- numeric(0)

works, whereas :

cars1[,c("mpg","disp")] <- numeric(0)

fails in the same way as cars1[,c("mpg","disp")] <- NULL

Case 2 : value is a list

If value is a list, then it is used to replace several columns at the same time. For example :

cars1[,c("mpg","disp")] <- list(1,2)

will replace cars1$mpg with a vector of 1s, and cars1$disp with a vector of 2s.

There is a sort of "double recycling" which happens here :

  • first, the length of the value list must be less than or equal to the number of columns to be replaced. If it is less, then a classic recycling is done.
  • second, for each element of the value list, its length must be equal to, greater than or a multiple of the number of rows to be replaced. If it is less, another recycling is done for each list element to match the number of rows. If it is more, a warning is displayed.

When the value in RHS is list(NULL), nothing really happens, as recycling is impossible (rep(NULL, 10) is always NULL). But the code continues and in the end each column to be replaced is assigned NULL, ie is removed.

Summary and (sort of) conclusion

data.frame and list behave differently because of the specific constraint on data frames, where each element must be of the same length. Removing several columns by assigning NULL fails not because of the NULL value by itself, but because NULL is of length 0. The error comes from a test which verifies if the length of the assigned value is a multiple of the number of elements to be replaced (number of rows * number of columns).

Handling the case of value=NULL for multiple columns doesn't seem difficult (by adding about four lines of simple code), but it requires to consider NULL as a special case. I'm not able to determine if it is not handled because it would break the logic of the function implementation, or because it would have side effects I don't know.

like image 150
juba Avatar answered Sep 29 '22 14:09

juba