Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R error "Can't join on ... because of incompatible types"

Tags:

r

dplyr

I write code that runs in high throughput production environments expected to handle a variety of input from multiple users. In this environment, I often need to join (with dplyr) two multi-column dataframes with join columns of mismatched types which produces this error. "Can't join on ... because of incompatible types"

The production code is expected to handle input csvs that are 150+ columns by 0-1000 rows, with 12-20 join columns. I use read.table for speed, and because the content of the 150 columns can vary, allow type to be inferred by data content.

It would be great to have a general-purpose function that matches join column types programmatically, making no assumptions about column name or location.

like image 301
GGAnderson Avatar asked Mar 10 '18 23:03

GGAnderson


Video Answer


2 Answers

This is a frequently viewed question, so many others must run into the error, so deserves a more complete answer.

The simple solution for correcting this join error is to simply mutate the class of the column(s) causing the problem. This can be done as follows:

  1. glimpse the column classes in the dataframes to be joined
  2. mutate the column class to match using as.numeric, as.logical or as.character. For example:

    df2 <- df2 %>%  
        mutate(column1 = as.numeric(column1))
    

A solution for production environments is in the matchColClasses function shown, which does the following:

  1. Identify columns that share the same name (sharedColNames)
  2. Use the master data frame (df1) to identify the shared columns classes
  3. Reassign column classes in df2 to match df1

    matchColClasses <- function(df1, df2) {
    
      sharedColNames <- names(df1)[names(df1) %in% names(df2)]
      sharedColTypes <- sapply(df1[,sharedColNames], class)
    
      for (n in sharedColNames) {
         class(df2[, n]) <- sharedColTypes[n]
      }
    
      return(df2)
     }
    

This function works well in our production environment, with heterogenous data types; character, numeric and logical.

like image 58
GGAnderson Avatar answered Sep 19 '22 14:09

GGAnderson


If the problem is that the classes of columns are not correctly detected after reading the data - you can specify the classes for each column at the data reading step.

So with read.csv() function you would do it like this:

df1 <- read.csv("data.csv", colClasses=c("numeric", "Date", "character"))
df2 <- read.csv("data.csv", colClasses=c("numeric", "Date", "character"))

After this df1 and df2 will have columns of the same types.

The classes here are just an example, you will have to specify the ones that apply to your data.

Also take a look at help(read.table)

like image 40
Karolis Koncevičius Avatar answered Sep 18 '22 14:09

Karolis Koncevičius