I have one non-empty data frame <code>df1</code> <pre class="prettyprint"><code>df1 <- structure(list(V1 = 1:4, V2 = 5:8), class = "data.frame", row.names = c(NA, -4L)) > df1 V1 V2 1 1 5 2 2 6 3 3 7 4 4 8 </code></pre> and two empty data frames <code>df2.a</code> and <code>df2.b</code>, i.e., <pre class="prettyprint"><code>df2.a <- structure(list(V1 = integer(0), V2 = integer(0), V3 = integer(0), V4 = integer(0)), row.names = integer(0), class = "data.frame") df2.b <- structure(list(V1 = NULL, V2 = NULL, V3 = NULL, V4 = NULL), row.names = c(NA, 0L), class = "data.frame") </code></pre> where <code>df2.a</code> and <code>df2.b</code> looks almost no difference (the only difference is shown when using <code>dput(df2.a)</code> and <code>dput(df2.b)</code>) <pre class="prettyprint"><code>> df2.a [1] V1 V2 V3 V4 <0 rows> (or 0-length row.names) > df2.b [1] V1 V2 V3 V4 <0 rows> (or 0-length row.names) </code></pre> However, when I tried to merge <code>df1</code> with <code>df2.a</code> or <code>df2.b</code>, something weird occurs <pre class="prettyprint"><code>> merge(df1,df2.a,all = TRUE) V1 V2 V3 V4 1 1 5 NA NA 2 2 6 NA NA 3 3 7 NA NA 4 4 8 NA NA > merge(df1,df2.b,all = TRUE) V1 V2 V4 1 1 5 NA 2 2 6 NA 3 3 7 NA 4 4 8 NA </code></pre> As you can see, <code>V3</code> is dropped when merging <code>df1</code> with <code>df2.b</code>, while the desired one should be something like the output of <code>merge(df1,df2.a,all = TRUE)</code>. Can someone explain a bit about this? Appreciated if there is a workaround to address the issue when using <code>merge</code> over <code>df1</code> and <code>df2.b</code>.

This is a complex one. The mis-step occurs in this line of <code>base::merge</code>: <pre class="prettyprint lang-r prettyprint-override"><code>y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone), -by.y, drop = FALSE] </code></pre> When you pass <code>df2.b</code> as the <code>y</code> argument to <code>merge</code>, this line actually produces an invalid data frame, as you can see in the browser: <pre class="prettyprint"><code>Browse[2]> y #> V4 #> NA NULL #> NA.1 <NA> #> NA.2 <NA> #> NA.3 <NA> #> Warning message: #> In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : #> corrupt data frame: columns will be truncated or padded with NAs </code></pre> If we trace the logic through, we can see that we can reproduce the error outside the debugger by calling: <pre class="prettyprint lang-r prettyprint-override"><code>df2.b[c(1, 1, 1, 1), -c(1:2), drop = FALSE] #> V4 #> NA NULL #> NA.1 <NA> #> NA.2 <NA> #> NA.3 <NA> #> Warning message: #> In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : #> corrupt data frame: columns will be truncated or padded with NAs </code></pre> Whereas, we don't get this problem for <code>db2.a</code>: <pre class="prettyprint lang-r prettyprint-override"><code>df2.a[c(1, 1, 1, 1), -c(1:2), drop = FALSE] #> V3 V4 #> NA NA NA #> NA.1 NA NA #> NA.2 NA NA #> NA.3 NA NA </code></pre> So why is this? Even though <code>df2.a</code> and <code>df2.b</code> look the same when you print the data frame, they are not the same. An empty numeric vector isn't quite the same as <code>NULL</code>. The main difference (the one that causes the problem here) is that indexing an empty numeric vector gives you a non-zero length of <code>NA</code> values, whereas NULL gives you a single <code>NULL</code> value. <pre class="prettyprint lang-r prettyprint-override"><code>df2.a$V1[1:4] #> [1] NA NA NA NA df2.b$V1[1:4] #> NULL </code></pre> So I guess this is expected behaviour. The problem is that R allows <code>NULL</code> as a dataframe column at all. I'm surprised this kind of thing doesn't happen more often.

I tracked the cause of this issue and found that this mistake arises in the following section of <code>merge.data.frame</code>: <pre class="prettyprint"><code>y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone), -by.y, drop = FALSE] </code></pre> To show the problem, try the following code: <pre class="prettyprint"><code>df2.b[rep(1, 4), -(1:2), drop = FALSE] # V4 # NA NULL # NA.1 <NA> # NA.2 <NA> # NA.3 <NA> # Warning message: # In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : # corrupt data frame: columns will be truncated or padded with NAs df2.a[rep(1, 4), -(1:2), drop = FALSE] # V3 V4 # 1: NA NA # 2: NA NA # 3: NA NA # 4: NA NA </code></pre> Therefore, this issue is caused by <code>[.data.frame</code>. A section of the source code of <code>[.data.frame</code> is: <pre class="prettyprint"><code>for (j in seq_along(x)) { xj <- xx[[sxx[j]]] x[[j]] <- if (length(dim(xj)) != 2L){ xj[i] }else{ xj[i, , drop = FALSE]} } </code></pre> here, <code>x</code> is the resulting data.frame to be returned. It now has columns V3 and V4 only. <code>xx</code> is a copy of the input data.frame (df2.b in our case). This for-loop will first assign <code>NULL</code> to column 1 of <code>x</code>. Thus, <code>V3</code> is deleted at this step. Next, the for-loop assigns <code>NULL</code> to the column 2 of <code>x</code>. However, as V3 is gone, there is no column 2. Therefore, x will not be affected. That's why we get the unexpected results. If we set <code>df1</code> and <code>df2.b</code> to <code>data.table</code>, merging of them will throw an error. It seems that <code>data.table::merge</code> treats such cases more strictly. The error message will help us avoid getting unexpected results.

I'll try to provide an answer as complete as I can... (When I posted the answer, I noticed I joint the party too late :D I'll leave the answer anyway as, I hope, it'll provide another interesting point of view) <hr> <h3>DEBUG MERGE</h3> Let's start by looking at the <code>merge</code> function. Specifically, the method that here gets called which is <code>merge.data.frame</code> (exported function of the <code>base</code> package). If you debug <code>merge.data.frame(df1,df2.b,all = TRUE)</code>, you'll see at line 124 that this gets called: <pre class="prettyprint"><code>y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone), -by.y, drop = FALSE] </code></pre> <code>y</code> is identical to <code>df2.b</code>. Since <code>m$yi</code> is equal to <code>integer(0)</code>, <code>all.x</code> is <code>TRUE</code>, and <code>all.y</code> is <code>FALSE</code>, this can be simplified to: <pre class="prettyprint"><code>y[rep.int(1L, nxx), -by.y, drop = FALSE] </code></pre> The output of it is: <pre class="prettyprint"><code> V2 V4 NA NULL NULL NA.1 <NA> <NA> NA.2 <NA> <NA> NA.3 <NA> <NA> Warning message: In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : corrupt data frame: columns will be truncated or padded with NAs </code></pre> So this is the behind-the-scene "problem" that <code>merge</code> tells us nothing about. Let's dig into it. First of all the actual output is not that, that's just the default <code>print.data.frame</code> method that tricks our eyes. The output of <pre class="prettyprint"><code>unclass(y[rep.int(1L, nxx), -by.y, drop = FALSE]) </code></pre> is <pre class="prettyprint"><code>$V4 NULL attr(,"row.names") [1] "NA" "NA.1" "NA.2" "NA.3" </code></pre> NULL doesn't get duplicated, which makes sense since you can't do a vector with two NULL <pre class="prettyprint"><code>identical(c(NULL, NULL), NULL) #> TRUE </code></pre> As the warning says, the data.frame is corrupted and the printing may be faulty (which it is!). That's because the data.frame was created in a tricky way with <code>structure()</code> instead of <code>data.frame()</code> or <code>as.data.frame()</code> which wouldn't have led you to that structure. So this is the story of how you get to one column only. The question is why? For that we need to go look at the function <code>[.data.frame</code>. <hr> <h3>DEBUG [.data.frame</h3> Let's observe some behaviors first. <pre class="prettyprint"><code>> df2.b[1,] V2 V4 NA NULL NULL > df2.b[,1] NULL > df2.b[,1, drop = FALSE] [1] V1 <0 rows> (or 0-length row.names) > df2.b[1,1] NULL > df2.b[1,1, drop = FALSE] data frame with 0 columns and 1 row > df2.b[1,1:2] V2 NA NULL > df2.b[c(1,1),1:2] V2 NA NULL NA.1 <NA> Warning message: In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x, : corrupt data frame: columns will be truncated or padded with NAs </code></pre> The last three look pretty unexpected. In particular the last one is our case. The same we saw before. if you try to debug: <pre class="prettyprint"><code>debugonce(base:::[.data.frame) df2.b[c(1,1),1:2] </code></pre> you'll find at line 109 this code: <pre class="prettyprint"><code> for (j in seq_along(x)) { xj <- xx[[sxx[j]]] x[[j]] <- if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE] } </code></pre> More readable: <pre class="prettyprint"><code> for (j in seq_along(x)) { xj <- xx[[sxx[j]]] x[[j]] <- if (length(dim(xj)) != 2L) xj[i] else xj[i, , drop = FALSE] } </code></pre> At that point, the variable are as follow: <pre class="prettyprint"><code>x = list(V1 = NULL, V2 = NULL) xx = df2.b sxx = 1:2 i = 1:2 </code></pre> If you run the for loop with those variables you will get that x is: <pre class="prettyprint"><code>> x $V2 NULL </code></pre> Looks like we found the source of the disappearing column. Now, where is exactly the problem? When <code>j == 1</code>, <code>x[[j]] <- ...</code> is equal to <code>x$V1 <- NULL</code> which in R allows you to delete the element V1 from a list. Therefore x becomes a list with only one element, this: <pre class="prettyprint"><code>> x $V2 NULL </code></pre> When <code>j == 2</code>, <code>x[[j]]</code> doesn't exist anymore because at the first loop the first item was deleted and now only one is available. Therefore R is trying to assign a new second item, but since you can't assign a NULL as item [like this: <code>x[[2]] <- NULL</code>], x will not change. Therefore you have only one column. <hr> <h3>SUM UP</h3> The reason why <code>merge</code> has a weird behavior is because you created your dataframe in an improper manner. <code>merge</code> doesn't tell you that the dataframe is actually corrupted and it does stuff even when it wouldn't be supposed to. Ultimately, it's <code>[</code> and its way to deal with subsetting that defines the final loss of one of the columns. <hr> <h3>DPLYR</h3> Honestly, just use <code>dplyr::full_join(df1, df2.b)</code>. It gives nothing for granted and it actually results in the error you would have expected from the beginning: <pre class="prettyprint"><code>> dplyr::full_join(df1, df2.b) Joining, by = c("V1", "V2") Error: All columns in a tibble must be vectors. x Column `V1` is NULL. x Column `V2` is NULL. x Column `V3` is NULL. x Column `V4` is NULL. </code></pre>

weird behavior when merging one non-empty data.frame with an empty one

Tags:

merge

dataframe

r

I have one non-empty data frame df1

df1 <- structure(list(V1 = 1:4, V2 = 5:8), class = "data.frame", row.names = c(NA, 
-4L))

> df1
  V1 V2
1  1  5
2  2  6
3  3  7
4  4  8

and two empty data frames df2.a and df2.b, i.e.,

df2.a <- structure(list(V1 = integer(0), V2 = integer(0), V3 = integer(0), V4 = integer(0)), row.names = integer(0), class = "data.frame")


df2.b <- structure(list(V1 = NULL, V2 = NULL, V3 = NULL, V4 = NULL), row.names = c(NA, 0L), class = "data.frame")

where df2.a and df2.b looks almost no difference (the only difference is shown when using dput(df2.a) and dput(df2.b))

> df2.a
[1] V1 V2 V3 V4
<0 rows> (or 0-length row.names)
> df2.b
[1] V1 V2 V3 V4
<0 rows> (or 0-length row.names)

However, when I tried to merge df1 with df2.a or df2.b, something weird occurs

> merge(df1,df2.a,all = TRUE)
  V1 V2 V3 V4
1  1  5 NA NA
2  2  6 NA NA
3  3  7 NA NA
4  4  8 NA NA

> merge(df1,df2.b,all = TRUE)
  V1 V2 V4
1  1  5 NA
2  2  6 NA
3  3  7 NA
4  4  8 NA

As you can see, V3 is dropped when merging df1 with df2.b, while the desired one should be something like the output of merge(df1,df2.a,all = TRUE).

Can someone explain a bit about this? Appreciated if there is a workaround to address the issue when using merge over df1 and df2.b.

900

asked Sep 22 '20 11:09

ThomasIsCoding

Video Answer

3 Answers

This is a complex one. The mis-step occurs in this line of base::merge:

y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone), 
            -by.y, drop = FALSE]

When you pass df2.b as the y argument to merge, this line actually produces an invalid data frame, as you can see in the browser:

Browse[2]> y
#>        V4
#> NA   NULL
#> NA.1 <NA>
#> NA.2 <NA>
#> NA.3 <NA>
#> Warning message:
#> In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x,  :
#>  corrupt data frame: columns will be truncated or padded with NAs

If we trace the logic through, we can see that we can reproduce the error outside the debugger by calling:

df2.b[c(1, 1, 1, 1), -c(1:2), drop = FALSE]
#>        V4
#> NA   NULL
#> NA.1 <NA>
#> NA.2 <NA>
#> NA.3 <NA>
#> Warning message:
#> In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x,  :
#>  corrupt data frame: columns will be truncated or padded with NAs

Whereas, we don't get this problem for db2.a:

df2.a[c(1, 1, 1, 1), -c(1:2), drop = FALSE]
#>      V3 V4
#> NA   NA NA
#> NA.1 NA NA
#> NA.2 NA NA
#> NA.3 NA NA

So why is this? Even though df2.a and df2.b look the same when you print the data frame, they are not the same. An empty numeric vector isn't quite the same as NULL. The main difference (the one that causes the problem here) is that indexing an empty numeric vector gives you a non-zero length of NA values, whereas NULL gives you a single NULL value.

df2.a$V1[1:4]
#> [1] NA NA NA NA

df2.b$V1[1:4]
#> NULL

So I guess this is expected behaviour. The problem is that R allows NULL as a dataframe column at all. I'm surprised this kind of thing doesn't happen more often.

answered Oct 28 '22 18:10

Allan Cameron

I tracked the cause of this issue and found that this mistake arises in the following section of merge.data.frame:

y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone), 
            -by.y, drop = FALSE]

To show the problem, try the following code:

df2.b[rep(1, 4), -(1:2), drop = FALSE]
#        V4
# NA   NULL
# NA.1 <NA>
# NA.2 <NA>
# NA.3 <NA>
# Warning message:
# In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x,  :
#   corrupt data frame: columns will be truncated or padded with NAs

df2.a[rep(1, 4), -(1:2), drop = FALSE]
#    V3 V4
# 1: NA NA
# 2: NA NA
# 3: NA NA
# 4: NA NA

Therefore, this issue is caused by [.data.frame. A section of the source code of [.data.frame is:

for (j in seq_along(x)) {
        xj <- xx[[sxx[j]]]
        x[[j]] <- if (length(dim(xj)) != 2L){
            xj[i]
        }else{ xj[i, , drop = FALSE]}
    }

here, x is the resulting data.frame to be returned. It now has columns V3 and V4 only. xx is a copy of the input data.frame (df2.b in our case). This for-loop will first assign NULL to column 1 of x. Thus, V3 is deleted at this step. Next, the for-loop assigns NULL to the column 2 of x. However, as V3 is gone, there is no column 2. Therefore, x will not be affected. That's why we get the unexpected results.

If we set df1 and df2.b to data.table, merging of them will throw an error. It seems that data.table::merge treats such cases more strictly. The error message will help us avoid getting unexpected results.

answered Oct 28 '22 16:10

mt1022

I'll try to provide an answer as complete as I can...

(When I posted the answer, I noticed I joint the party too late :D I'll leave the answer anyway as, I hope, it'll provide another interesting point of view)

DEBUG MERGE

Let's start by looking at the merge function. Specifically, the method that here gets called which is merge.data.frame (exported function of the base package).

If you debug merge.data.frame(df1,df2.b,all = TRUE), you'll see at line 124 that this gets called:

y <- y[c(m$yi, if (all.x) rep.int(1L, nxx), if (all.y) m$y.alone), 
   -by.y, drop = FALSE]

y is identical to df2.b.

Since m$yi is equal to integer(0), all.x is TRUE, and all.y is FALSE, this can be simplified to:

y[rep.int(1L, nxx), -by.y, drop = FALSE]

The output of it is:

       V2   V4
NA   NULL NULL
NA.1 <NA> <NA>
NA.2 <NA> <NA>
NA.3 <NA> <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x,  :
   corrupt data frame: columns will be truncated or padded with NAs

So this is the behind-the-scene "problem" that merge tells us nothing about.

Let's dig into it.

First of all the actual output is not that, that's just the default print.data.frame method that tricks our eyes.

The output of

unclass(y[rep.int(1L, nxx), -by.y, drop = FALSE])

$V4
NULL

attr(,"row.names")
[1] "NA"   "NA.1" "NA.2" "NA.3"

NULL doesn't get duplicated, which makes sense since you can't do a vector with two NULL

identical(c(NULL, NULL), NULL)
#> TRUE

As the warning says, the data.frame is corrupted and the printing may be faulty (which it is!).

That's because the data.frame was created in a tricky way with structure() instead of data.frame() or as.data.frame() which wouldn't have led you to that structure.

So this is the story of how you get to one column only.

The question is why?

For that we need to go look at the function [.data.frame.

DEBUG [.data.frame

Let's observe some behaviors first.

> df2.b[1,]
     V2   V4
NA NULL NULL
> df2.b[,1]
NULL
> df2.b[,1, drop = FALSE]
[1] V1
<0 rows> (or 0-length row.names)
> df2.b[1,1]
NULL
> df2.b[1,1, drop = FALSE]
data frame with 0 columns and 1 row
> df2.b[1,1:2]
     V2
NA NULL
> df2.b[c(1,1),1:2]
       V2
NA   NULL
NA.1 <NA>
Warning message:
In format.data.frame(if (omit) x[seq_len(n0), , drop = FALSE] else x,  :
   corrupt data frame: columns will be truncated or padded with NAs

The last three look pretty unexpected. In particular the last one is our case. The same we saw before.

if you try to debug:

debugonce(base:::[.data.frame)
df2.b[c(1,1),1:2]

you'll find at line 109 this code:

 for (j in seq_along(x)) {
  xj <- xx[[sxx[j]]]
  x[[j]] <- if (length(dim(xj)) != 2L) 
   xj[i]
  else xj[i, , drop = FALSE]
 }

SUM UP

The reason why merge has a weird behavior is because you created your dataframe in an improper manner.

merge doesn't tell you that the dataframe is actually corrupted and it does stuff even when it wouldn't be supposed to.

Ultimately, it's [ and its way to deal with subsetting that defines the final loss of one of the columns.

DPLYR

Honestly, just use dplyr::full_join(df1, df2.b). It gives nothing for granted and it actually results in the error you would have expected from the beginning:

> dplyr::full_join(df1, df2.b)
Joining, by = c("V1", "V2")
Error: All columns in a tibble must be vectors.
x Column `V1` is NULL.
x Column `V2` is NULL.
x Column `V3` is NULL.
x Column `V4` is NULL.

answered Oct 28 '22 17:10

Edo

Related questions
                            
                                R - converting date and time fields to POSIXct with HHMMSS format
                            
                                closing unused RODBC handle
                            
                                Start new R package development on github
                            
                                How to show bars in ggplot2 in descending order of a numeric vector?
                            
                                Equivalent of transform in R/ddply in Python/pandas?
                            
                                How to list all graph vertex attributes in R?
                            
                                Evaluate at which size data.table is faster than data.frame
                            
                                How do I find the polygon nearest to a point in R?
                            
                                How to extract one specific group in dplyr
                            
                                How to reorder a legend in ggplot2?
                            
                                "You must provide a hash." error when using API to download data (in R)
                            
                                Plot title at bottom of plot using ggplot2
                            
                                How to convert factor levels to list, in R
                            
                                Using R to scrape the link address of a downloadable file from a web page?
                            
                                R: Understanding standard evaluation in mutate_
                            
                                dplyr arrange() function sort by missing values
                            
                                Remove text after the second space
                            
                                shiny leaflet ploygon click event
                            
                                double integral in R
                            
                                Cluster-Robust Standard Errors in Stargazer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With