Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatically coerce all column types of one data frame to the type of another prior to binding

Tags:

r

dplyr

Let's say I have two data frames I want to bind:

ds_a <- data.frame(
  x = 1:6,
  y = 5,
  z = "4",
  l = 2,
  stringsAsFactors = FALSE
)

ds_b <- data.frame(
  x = as.factor(1:6),
  y = "5",
  p = 2,
  stringsAsFactors = FALSE
)

When I try to bind them I get the following error:

> bind_rows(ds_a, ds_b)
Error: Can't combine `..1$x` <integer> and `..2$x` <factor<4c79c>>.

Typically what I do to solve this is I convert all the columns in both data frames to a character, bind the two data frames, and then manually re-convert all the columns back to their original type.

Is there a way to simply coerce all the type collisions between ds_a and ds_b by automatically casting ds_b's columns to match ds_a (assuming they're named the same)?

More generally, I'd like a solution to automatically convert all the columns in ds_b to the type of ds_a wherever the column names match. And the solution should work if ds_b and ds_a don't share all the same columns (just filling with NA when columns don't exist in one, but do in another).

Here's the intended outcome:

ds_merged =read.table(text = 'x y z l p 
1 1 5 4 2 NA
2 2 5 4 2 NA
3 3 5 4 2 NA
4 4 5 4 2 NA
5 5 5 4 2 NA
6 6 5 4 2 NA
7 1 5 NA NA 2
8 2 5 NA NA 2
9 3 5 NA NA 2
10 4 5 NA NA 2
11 5 5 NA NA 2
12 6 5 NA NA 2', header = TRUE, row.names = NULL)

> ds_merged

   row.names x y  z  l  p
1          1 1 5  4  2 NA
2          2 2 5  4  2 NA
3          3 3 5  4  2 NA
4          4 4 5  4  2 NA
5          5 5 5  4  2 NA
6          6 6 5  4  2 NA
7          7 1 5 NA NA  2
8          8 2 5 NA NA  2
9          9 3 5 NA NA  2
10        10 4 5 NA NA  2
11        11 5 5 NA NA  2
12        12 6 5 NA NA  2
like image 403
Parseltongue Avatar asked Oct 25 '21 19:10

Parseltongue


People also ask

How to change all columns of a Dataframe to a specific type?

We can pass any Python, Numpy or Pandas datatype to change all columns of a dataframe to that type, or we can pass a dictionary having column names as keys and datatype as values to change type of selected columns.

Which columns in the data frame have the character class?

As you can see, all columns in our data frame have the character class, even though the columns x2 and x3 contain integers and numerics. Let’s change that!

How to use inference in dtyped columns?

For object-dtyped columns, if infer_objects is True, use the inference rules as during normal Series/DataFrame construction. Then, if possible, convert to StringDtype, BooleanDtype or an appropriate integer or floating extension type, otherwise leave as object. If the dtype is integer, convert to an appropriate integer extension type.

How many rows and columns are in a data frame?

As you can see based on Table 1, our example data is a data frame constructed of six rows and three columns. As you can see, all columns in our data frame have the character class, even though the columns x2 and x3 contain integers and numerics.


2 Answers

We could use type.convert()

Explanation: after comment of OP:

type_convert does not consider ds_a (you can check if you compare glimpse(ds_a) with glimpse of the resulting dataframe:

Note the columns of ds_a have the same classes as in result.

> # compare classes
> glimpse(ds_a)
Rows: 6
Columns: 4
$ x <int> 1, 2, 3, 4, 5, 6
$ y <dbl> 5, 5, 5, 5, 5, 5
$ z <chr> "4", "4", "4", "4", "4", "4"
$ l <dbl> 2, 2, 2, 2, 2, 2
> glimpse(ds_b)
Rows: 6
Columns: 3
$ x <fct> 1, 2, 3, 4, 5, 6
$ y <chr> "5", "5", "5", "5", "5", "5"
$ p <dbl> 2, 2, 2, 2, 2, 2
> glimpse(result)
Rows: 12
Columns: 5
$ x <int> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6
$ y <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
$ z <chr> "4", "4", "4", "4", "4", "4", NA, NA, NA, NA, NA, NA
$ l <dbl> 2, 2, 2, 2, 2, 2, NA, NA, NA, NA, NA, NA
$ p <int> NA, NA, NA, NA, NA, NA, 2, 2, 2, 2, 2, 2

What type.convert does is:

  1. to apply the best fitting class to the data of ds_b (notice the %>% is within bind_rows). So all of ds_b$x are integers therefore R converts class factor to class integer in ds_b$x.
  2. All of ds_b$y are character class but integers in nature, therefore R converts character class to integer class. This may cause the misleading understanding. But, now we have ds_a$y double class and ds_b$y integer class -> but this is no problem for R and bind_rows here double class overrides integer.
> # showing what type.convert does to ds_b
> ds_b$x <- as.integer(ds_b$x)
> ds_b$y <- as.integer(ds_b$y)
> ds_b %>% 
+   as_tibble()
# A tibble: 6 x 3
      x     y     p
  <int> <int> <dbl>
1     1     5     2
2     2     5     2
3     3     5     2
4     4     5     2
5     5     5     2
6     6     5     2
> ds_b %>% 
+   as_tibble()
# A tibble: 6 x 3
      x     y     p
  <int> <int> <dbl>
1     1     5     2
2     2     5     2
3     3     5     2
4     4     5     2
5     5     5     2
6     6     5     2
> bind_rows(ds_a, ds_b) %>% 
+   as_tibble()
# A tibble: 12 x 5
       x     y z         l     p
   <int> <dbl> <chr> <dbl> <dbl>
 1     1     5 4         2    NA
 2     2     5 4         2    NA
 3     3     5 4         2    NA
 4     4     5 4         2    NA
 5     5     5 4         2    NA
 6     6     5 4         2    NA
 7     1     5 NA       NA     2
 8     2     5 NA       NA     2
 9     3     5 NA       NA     2
10     4     5 NA       NA     2
11     5     5 NA       NA     2
12     6     5 NA       NA     2
  1. converts ds_b$p which is class double to class integer because the data are integer in nature.

Solution:

library(dplyr)
bind_rows(ds_a, ds_b %>% type.convert(as.is=TRUE))

output:

   x y    z  l  p
1  1 5    4  2 NA
2  2 5    4  2 NA
3  3 5    4  2 NA
4  4 5    4  2 NA
5  5 5    4  2 NA
6  6 5    4  2 NA
7  1 5 <NA> NA  2
8  2 5 <NA> NA  2
9  3 5 <NA> NA  2
10 4 5 <NA> NA  2
11 5 5 <NA> NA  2
12 6 5 <NA> NA  2
like image 114
TarJae Avatar answered Oct 19 '22 23:10

TarJae


You can change the class of one dataframe according to another one and row bind the datasets.

library(dplyr)
library(purrr)

bind_rows(ds_a, map2_df(ds_b, map(ds_a, class), ~{class(.x) <- .y;.x}))

#   x y
#1  1 5
#2  2 5
#3  3 5
#4  4 5
#5  5 5
#6  6 5
#7  1 5
#8  2 5
#9  3 5
#10 4 5
#11 5 5
#12 6 5

map2_df is used to changes the class of ds_b data where

.x - passes the column value of ds_b.

.y - map(ds_a, class) gets the class of each column in ds_a

In the function it changes class of .x with .y value and bind them. We then use bind_rows with ds_a dataframe.


If there are unequal number of columns you can change the classes of only common ones and bind the rows.

new_bind <- function(a, b) {
  common_cols <- intersect(names(a), names(b))
  b[common_cols] <- map2_df(b[common_cols], 
               map(a[common_cols], class), ~{class(.x) <- .y;.x})
  bind_rows(a, b)  
}
new_bind(ds_a, ds_b) 

#   x y    z  l  p
#1  1 5    4  2 NA
#2  2 5    4  2 NA
#3  3 5    4  2 NA
#4  4 5    4  2 NA
#5  5 5    4  2 NA
#6  6 5    4  2 NA
#7  1 5 <NA> NA  2
#8  2 5 <NA> NA  2
#9  3 5 <NA> NA  2
#10 4 5 <NA> NA  2
#11 5 5 <NA> NA  2
#12 6 5 <NA> NA  2            
like image 24
Ronak Shah Avatar answered Oct 19 '22 22:10

Ronak Shah