I have one non-empty data frame df1
df1 <- structure(list(V1 = 1:4, V2 = 5:8), class = "data.frame", row.names = c(NA,
-4L))
> df1
V1 V2
1 1 5
2 2 6
3 3 7
4 4 8
and two empty data frames df2.a
and df2.b
, i.e.,
df2.a <- structure(list(V1 = integer(0), V2 = integer(0), V3 = integer(0), V4 = integer(0)), row.names = integer(0), class = "data.frame")
df2.b <- structure(list(V1 = NULL, V2 = NULL, V3 = NULL, V4 = NULL), row.names = c(NA, 0L), class = "data.frame")
where df2.a
and df2.b
looks almost no difference (the only difference is shown when using dput(df2.a)
and dput(df2.b)
)
> df2.a
[1] V1 V2 V3 V4
<0 rows> (or 0-length row.names)
> df2.b
[1] V1 V2 V3 V4
<0 rows> (or 0-length row.names)
However, when I tried to merge df1
with df2.a
or df2.b
, something weird occurs
> merge(df1,df2.a,all = TRUE)
V1 V2 V3 V4
1 1 5 NA NA
2 2 6 NA NA
3 3 7 NA NA
4 4 8 NA NA
> merge(df1,df2.b,all = TRUE)
V1 V2 V4
1 1 5 NA
2 2 6 NA
3 3 7 NA
4 4 8 NA
As you can see, V3
is dropped when merging df1
with df2.b
, while the desired one should be something like the output of merge(df1,df2.a,all = TRUE)
.
Can someone explain a bit about this? Appreciated if there is a workaround to address the issue when using merge
over df1
and df2.b
.
join function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame.
Use the full_join Function to Merge Two R Data Frames With Different Number of Rows. full_join is part of the dplyr package, and it can be used to merge two data frames with a different number of rows.
Often, we get data frames that are not of same size that means some of the rows or columns are missing in any of the data frames. Therefore, to merge these types of data frames we can merge them with all of their values and convert the missing values to zero if necessary.
However, some of the cells of the merged data are NA. We can now replace these missing values with zero: Looks good! But note that such a replacement should only be done with theoretical justification. Otherwise the results created based on the merged data may be biased. Would you like to know more about the merging of data frames?
However, you can also see that the IDs are not equal in the two data frames. In this Example, I’ll show how to combine two unequal data frames and how to replace occurring NA values with 0. First, we are merging the two data frames together: As you can see based on the previous output, we created a merge of our two input data sets.
When the default value of the how parameter is set to inner, a new DataFrame is generated from the intersection of the left and right DataFrames. Therefore, if a user_id is missing in one of the tables, it would not be in the merged DataFrame.
This is a complex one. The mis-step occurs in this line of base::merge
:
y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone),
-by.y, drop = FALSE]
When you pass df2.b
as the y
argument to merge
, this line actually produces an invalid data frame, as you can see in the browser:
Browse[2]> y
#> V4
#> NA NULL
#> NA.1 <NA>
#> NA.2 <NA>
#> NA.3 <NA>
#> Warning message:
#> In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
If we trace the logic through, we can see that we can reproduce the error outside the debugger by calling:
df2.b[c(1, 1, 1, 1), -c(1:2), drop = FALSE]
#> V4
#> NA NULL
#> NA.1 <NA>
#> NA.2 <NA>
#> NA.3 <NA>
#> Warning message:
#> In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
#> corrupt data frame: columns will be truncated or padded with NAs
Whereas, we don't get this problem for db2.a
:
df2.a[c(1, 1, 1, 1), -c(1:2), drop = FALSE]
#> V3 V4
#> NA NA NA
#> NA.1 NA NA
#> NA.2 NA NA
#> NA.3 NA NA
So why is this? Even though df2.a
and df2.b
look the same when you print the data frame, they are not the same. An empty numeric vector isn't quite the same as NULL
. The main difference (the one that causes the problem here) is that indexing an empty numeric vector gives you a non-zero length of NA
values, whereas NULL gives you a single NULL
value.
df2.a$V1[1:4]
#> [1] NA NA NA NA
df2.b$V1[1:4]
#> NULL
So I guess this is expected behaviour. The problem is that R allows NULL
as a dataframe column at all. I'm surprised this kind of thing doesn't happen more often.
I tracked the cause of this issue and found that this mistake arises in the following section of merge.data.frame
:
y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone),
-by.y, drop = FALSE]
To show the problem, try the following code:
df2.b[rep(1, 4), -(1:2), drop = FALSE]
# V4
# NA NULL
# NA.1 <NA>
# NA.2 <NA>
# NA.3 <NA>
# Warning message:
# In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
# corrupt data frame: columns will be truncated or padded with NAs
df2.a[rep(1, 4), -(1:2), drop = FALSE]
# V3 V4
# 1: NA NA
# 2: NA NA
# 3: NA NA
# 4: NA NA
Therefore, this issue is caused by [.data.frame
. A section of the source code of [.data.frame
is:
for (j in seq_along(x)) {
xj <- xx[[sxx[j]]]
x[[j]] <- if (length(dim(xj)) != 2L){
xj[i]
}else{ xj[i, , drop = FALSE]}
}
here, x
is the resulting data.frame to be returned. It now has columns V3 and V4 only. xx
is a copy of the input data.frame (df2.b in our case). This for-loop will first assign NULL
to column 1 of x
. Thus, V3
is deleted at this step. Next, the for-loop assigns NULL
to the column 2 of x
. However, as V3 is gone, there is no column 2. Therefore, x will not be affected. That's why we get the unexpected results.
If we set df1
and df2.b
to data.table
, merging of them will throw an error. It seems that data.table::merge
treats such cases more strictly. The error message will help us avoid getting unexpected results.
I'll try to provide an answer as complete as I can...
(When I posted the answer, I noticed I joint the party too late :D I'll leave the answer anyway as, I hope, it'll provide another interesting point of view)
Let's start by looking at the merge
function. Specifically, the method that here gets called which is merge.data.frame
(exported function of the base
package).
If you debug merge.data.frame(df1,df2.b,all = TRUE)
, you'll see at line 124 that this gets called:
y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone),
-by.y, drop = FALSE]
y
is identical to df2.b
.
Since m$yi
is equal to integer(0)
, all.x
is TRUE
, and all.y
is FALSE
, this can be simplified to:
y[rep.int(1L, nxx), -by.y, drop = FALSE]
The output of it is:
V2 V4
NA NULL NULL
NA.1 <NA> <NA>
NA.2 <NA> <NA>
NA.3 <NA> <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs
So this is the behind-the-scene "problem" that merge
tells us nothing about.
Let's dig into it.
First of all the actual output is not that, that's just the default print.data.frame
method that tricks our eyes.
The output of
unclass(y[rep.int(1L, nxx), -by.y, drop = FALSE])
is
$V4
NULL
attr(,"row.names")
[1] "NA" "NA.1" "NA.2" "NA.3"
NULL doesn't get duplicated, which makes sense since you can't do a vector with two NULL
identical(c(NULL, NULL), NULL)
#> TRUE
As the warning says, the data.frame is corrupted and the printing may be faulty (which it is!).
That's because the data.frame was created in a tricky way with structure()
instead of data.frame()
or as.data.frame()
which wouldn't have led you to that structure.
So this is the story of how you get to one column only.
The question is why?
For that we need to go look at the function [.data.frame
.
Let's observe some behaviors first.
> df2.b[1,]
V2 V4
NA NULL NULL
> df2.b[,1]
NULL
> df2.b[,1, drop = FALSE]
[1] V1
<0 rows> (or 0-length row.names)
> df2.b[1,1]
NULL
> df2.b[1,1, drop = FALSE]
data frame with 0 columns and 1 row
> df2.b[1,1:2]
V2
NA NULL
> df2.b[c(1,1),1:2]
V2
NA NULL
NA.1 <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, :
corrupt data frame: columns will be truncated or padded with NAs
The last three look pretty unexpected. In particular the last one is our case. The same we saw before.
if you try to debug:
debugonce(base:::[.data.frame)
df2.b[c(1,1),1:2]
you'll find at line 109 this code:
for (j in seq_along(x)) {
xj <- xx[[sxx[j]]]
x[[j]] <- if (length(dim(xj)) != 2L)
xj[i]
else xj[i, , drop = FALSE]
}
More readable:
for (j in seq_along(x)) {
xj <- xx[[sxx[j]]]
x[[j]] <- if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE]
}
At that point, the variable are as follow:
x = list(V1 = NULL, V2 = NULL)
xx = df2.b
sxx = 1:2
i = 1:2
If you run the for loop with those variables you will get that x is:
> x
$V2
NULL
Looks like we found the source of the disappearing column.
Now, where is exactly the problem?
When j == 1
, x[[j]] <- ...
is equal to x$V1 <- NULL
which in R allows you to delete the element V1 from a list. Therefore x becomes a list with only one element, this:
> x
$V2
NULL
When j == 2
, x[[j]]
doesn't exist anymore because at the first loop the first item was deleted and now only one is available. Therefore R is trying to assign a new second item, but since you can't assign a NULL as item [like this: x[[2]] <- NULL
], x will not change.
Therefore you have only one column.
The reason why merge
has a weird behavior is because you created your dataframe in an improper manner.
merge
doesn't tell you that the dataframe is actually corrupted and it does stuff even when it wouldn't be supposed to.
Ultimately, it's [
and its way to deal with subsetting that defines the final loss of one of the columns.
Honestly, just use dplyr::full_join(df1, df2.b)
. It gives nothing for granted and it actually results in the error you would have expected from the beginning:
> dplyr::full_join(df1, df2.b)
Joining, by = c("V1", "V2")
Error: All columns in a tibble must be vectors.
x Column `V1` is NULL.
x Column `V2` is NULL.
x Column `V3` is NULL.
x Column `V4` is NULL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With