Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bind_rows of different data types

Tags:

r

dplyr

redcap

I'd like to stack a list of data.frames, but sometimes the columns have different data types. I'd like the operation to coerce to the lowest common denominator (which is usually character in my case).

This stacking occurs inside a package function that accepts almost any list of data.frames. It doesn't realistically have the ability to coerce ds_a$x to a character before bind_rows().

ds_a <- data.frame(
  x = 1:6,
  stringsAsFactors = FALSE
)
ds_b <- data.frame(
  x = c("z1", "z2"),
  stringsAsFactors = FALSE
)

# These four implementations throw:
# Error: Can not automatically convert from integer to character in column "x".
ds_1 <- dplyr::bind_rows(ds_a, ds_b)
ds_2 <- dplyr::bind_rows(ds_b, ds_a)
ds_3 <- dplyr::bind_rows(list(ds_a, ds_b))
ds_4 <- dplyr::union_all(ds_a, ds_b)

I'd like the output to be a data.frame with a single character vector:

   x
1  1
2  2
3  3
4  4
5  5
6  6
7 z1
8 z2

I have some long-term plans to use meta-data from the (REDCap) database to influence the coercion, but I'm hoping there's a short-term general solution for the stacking operation.

like image 820
wibeasley Avatar asked Sep 07 '16 19:09

wibeasley


2 Answers

We can use rbindlist from data.table

library(data.table)
rbindlist(list(ds_a, ds_b))
#    x
#1:  1
#2:  2
#3:  3
#4:  4
#5:  5
#6:  6
#7: z1
#8: z2
like image 160
akrun Avatar answered Oct 04 '22 07:10

akrun


Recently I switched to an approach that keeps all columns as strings initially(when converting from plain-text to a data.frame), then stacks, and finally converts the columns to an appropriate data type after it has all the rows to make a decision (using readr::type_convert()).

It mimics this example. I haven't done any performance comparisons but there wasn't a noticeable difference (the internet is the real bottleneck). Also, I kinda like the idea of reducing the number of data type conversions.

library(magrittr)
col_types <- readr::cols(.default = readr::col_character())
raw_a <- "x,y\n1,21\n2,22\n3,23\n4,24\n5,25\n6,26"
raw_b <- "x,y\nz1,31\nz2,32"
ds_a <- readr::read_csv(raw_a, col_types=col_types)
ds_b <- readr::read_csv(raw_b, col_types=col_types)

list(ds_a, ds_b) %>% 
  dplyr::bind_rows() %>% 
  readr::type_convert()
#> Parsed with column specification:
#> cols(
#>   x = col_character(),
#>   y = col_double()
#> )
#> # A tibble: 8 x 2
#>   x         y
#>   <chr> <dbl>
#> 1 1        21
#> 2 2        22
#> 3 3        23
#> 4 4        24
#> 5 5        25
#> 6 6        26
#> 7 z1       31
#> 8 z2       32

Created on 2019-12-03 by the reprex package (v0.3.0)

like image 36
wibeasley Avatar answered Oct 04 '22 09:10

wibeasley