Many R users eventually figure out lots of ways to remove elements from their data. One way is to use NULL
, particularly when you want to do something like drop a column from a data.frame
or drop an element from a list
.
Eventually, a user comes across a situation where they want to drop several columns from a data.frame
at once, and they hit upon <- list(NULL)
as the solution (since using <- NULL
will result in an error).
A data.frame
is a special type of list
, so it wouldn't be too tough to imagine that the approaches for removing items from a list
should be the same as removing columns from a data.frame
. However, they produce different results, as can be seen in the example below.
## Make some small data--two data.frames and two lists
cars1 <- cars2 <- head(mtcars)[1:4]
cars3 <- cars4 <- as.list(cars2)
## Demonstration that the `list(NULL)` approach works
cars1[c("mpg", "cyl")] <- list(NULL)
cars1
# disp hp
# Mazda RX4 160 110
# Mazda RX4 Wag 160 110
# Datsun 710 108 93
# Hornet 4 Drive 258 110
# Hornet Sportabout 360 175
# Valiant 225 105
## Demonstration that simply using `NULL` does not work
cars2[c("mpg", "cyl")] <- NULL
# Error in `[<-.data.frame`(`*tmp*`, c("mpg", "cyl"), value = NULL) :
# replacement has 0 items, need 12
Switch to applying the same concept to a list
, and compare the difference in behavior.
## Does not fully drop the items, but sets them to `NULL`
cars3[c("mpg", "cyl")] <- list(NULL)
# $mpg
# NULL
#
# $cyl
# NULL
#
# $disp
# [1] 160 160 108 258 360 225
#
# $hp
# [1] 110 110 93 110 175 105
## *Does* drop the `list` items while this would
## have produced an error with a `data.frame`
cars4[c("mpg", "cyl")] <- NULL
# $disp
# [1] 160 160 108 258 360 225
#
# $hp
# [1] 110 110 93 110 175 105
The main questions I have are, if a data.frame
is a list
, why does it behave so differently in this scenario? Is there a foolproof way of knowing when an element will be dropped, when it will produce an error, and when it will simply be given a NULL
value? Or do we depend on trial-and-error for this?
Lists can have components of the same type or mode, or components of different types or modes. They can hence combine different components (numeric, logical…) in a single object. A Data frame is simply a List of a specified class called “data.
Data frame columns can contain lists Taking into account the list structure of the column, we can type the following to change the values in a single cell. You can also create a data frame having a list as a column using the data.
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
Data Frames are data displayed in a format as a table. Data Frames can have different types of data inside it. While the first column can be character , the second and third can be numeric or logical . However, each column should have the same type of data.
DISCLAIMER : This is a relatively long answer, not very clear, and not very interesting, so feel free to skip it or to only read the (sort of) conclusion.
I've tried a bit of tracing on
[<-.data.frame
, as suggested by Ari B. Friedman. Debugging starts on line 162 of the function, where there is a test to determine if value
(the replacement value argument) is not a list.
value
is not a listThen it is considered as a vector. Matrices and arrays are considered as one vector, like the help page says :
Note that when the replacement value is an array (including a matrix) it is not treated as a series of columns (as 'data.frame’ and ‘as.data.frame’ do) but inserted as a single column.
If only one column of the data frame is selected in the LHS, then the only constraint is that the number of rows to be replaced must be equal to or a multiple of length(value)
. If this is the case, value
is recycled with rep
if necessary and converted to a list. If length(value)==0
, there is no recycling (as it is impossible), and value
is just converted to a list.
If several columns of the data frame are selected in the LHS, then the constraint is a bit more complex : length(value)
must be equal to or a multiple of the total number of elements to be replaced, ie the number of rows * the number of columns.
The exact test is the following :
(m < n * p && (m == 0L || (n * p)%%m))
Where n
is the number of rows, p
the number of columns, and m
the length of value
. If the condition is FALSE, then value
is converted into an n x p
matrix (thus recycled if necessary) and the matrix is splitted by columns into a list.
If value
is NULL, then the condition is TRUE as m==0
, and the function is stopped.
Note that the problem occurs for every value
of length 0. For example,
cars1[,c("mpg")] <- numeric(0)
works, whereas :
cars1[,c("mpg","disp")] <- numeric(0)
fails in the same way as cars1[,c("mpg","disp")] <- NULL
value
is a listIf value
is a list, then it is used to replace several columns at the same time. For example :
cars1[,c("mpg","disp")] <- list(1,2)
will replace cars1$mpg
with a vector of 1s, and cars1$disp
with a vector of 2s.
There is a sort of "double recycling" which happens here :
value
list must be less than or equal to the number of columns to be replaced. If it is less, then a classic recycling is done.value
list, its length must be equal to, greater than or a multiple of the number of rows to be replaced. If it is less, another recycling is done for each list element to match the number of rows. If it is more, a warning is displayed.When the value
in RHS is list(NULL)
, nothing really happens, as recycling is impossible (rep(NULL, 10)
is always NULL
). But the code continues and in the end each column to be replaced is assigned NULL
, ie is removed.
data.frame
and list
behave differently because of the specific constraint on data frames, where each element must be of the same length. Removing several columns by assigning NULL
fails not because of the NULL
value by itself, but because NULL
is of length 0. The error comes from a test which verifies if the length of the assigned value is a multiple of the number of elements to be replaced (number of rows * number of columns).
Handling the case of value=NULL
for multiple columns doesn't seem difficult (by adding about four lines of simple code), but it requires to consider NULL
as a special case. I'm not able to determine if it is not handled because it would break the logic of the function implementation, or because it would have side effects I don't know.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With