Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any way to automatically correct all variable classes in a dataframe

I have a dataframe with about ~250 variables. Unfortunately, all of these variables were imported as character classes from a sql database using sqldf. The problem: all of them should not be character classes. There are numeric variables, integers, as well as dates. I'd like to build a model that runs over all the variables and to do this I need to make sure that variables have the right classes. Doing it one by one is probably best, but still very manual.

How could I automatically correct all classes? Perhaps a way to detect whether there are alphabet characters in the column or only number characters?

I don't think it's possible for an automatic approach to be perfect in correcting all classes. But it might correct most of the classes, then those that are not good, I can take care of them manually.

I am adding a sqldf tag in case anybody knows of any way to correct this when importing the data, but I assume it's not sqldf's fault but rather the database's.

like image 506
jgozal Avatar asked Jan 04 '16 20:01

jgozal


1 Answers

The closest thing to "automatic" type conversion on a data frame would probably be

df[] <- lapply(df, type.convert)

where df is your data set. The function type.convert()

Converts a character vector to logical, integer, numeric, complex or factor as appropriate.

Have a read of help(type.convert), it might be just what you want.

In my experience, type.convert() is very reliable. You can use as.is = TRUE if you don't want characters coerced to factors. Plus it's used internally in many important R functions (like read.table), so it's definitely safe.

Here's a quick example of it working on iris. First we'll change all columns to character, then run type.convert() on it.

## Original column classes in iris
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#    "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

## Change all columns to character
iris[] <- lapply(iris, as.character)
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#  "character"  "character"  "character"  "character"  "character" 

## Run type.convert()
iris[] <- lapply(iris, type.convert)
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#    "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

We can see that the columns were returned to their original classes. This is because type.convert() coerces columns to the "most appropriate" type.

like image 178
Rich Scriven Avatar answered Oct 31 '22 10:10

Rich Scriven