Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Criteria for deciding which character columns should be converted to factors

Tags:

r

I have been working through the book "Analyzing Baseball Data with R" by Marchi and Albert and am wondering about an issue which they don't address.

Many of the datasets I need to import are fairly large (though not really "Big" in the sense of "Big Data"). For example, the Retrosheet Game Logs have 1 csv file per year dating back to 1871 where each file has a row for each game played that year, and 161 columns. When I read it into a dataframe using read.csv() using the default setting on stringsAsFactors fully 75 of the 161 columns become factors. Some of these columns conceptually are factors (such as one containing "D" or "N" for day or night games) but others are probably better left as strings (many of the columns contain names of starting pitchers, closers, etc.) I know how to convert columns from factors to strings or vice versa, but I don't want to have to scan through 161 columns, making an explicit decision for 75 of them.

The reason I think it important is that I've noticed that conceptually small dataframes obtained by subsetting these game logs are surprisingly large given the need to retain the full factor information. For example, given the dataframe GL2016 obtained from downloading, unzipping and the reading in the file, object.size(GL2016) is about 2.8 MB, and when I use:

df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])

to extract the home day games played by the Cleveland Indians in 2016, I get a df with 26 rows. 26/2428 (where 2428 is the number of rows in the whole dataframe) is slightly more than 1%, but object.size(df) is around 1.3 MB, which is far more than 1% of the size of GL2016.

I came up with an ad-hoc solution. I first defined a function:

big.factor <- function(v,k){is.factor(v) && length(levels(v)) > k}

And then used mutate_if from dplyr like thus:

GL2016 %>% mutate_if(function(v){big.factor(v,30)},as.character) -> GL2016

30 is the number of teams in the MLB and I somewhat arbitrarily decided that any factor with more than 30 levels should probably be treated as a string.

After this code has been run, the number of factor variables has been reduced from 75 to 12. It works in the sense that even though now GL2016 is around 3.2 MB (slightly larger than before), if I now subset the dataframe to pull out the Cleveland day games, the resulting dataframe is just 0.1 MB.

Questions:

1) What criteria (hopefully less ad-hoc than what I used above) are relevant for deciding which character columns should be converted to factors when importing a large data set?

2) I am aware of the cost in terms of memory footprint of converting all character data to factors, but am I incurring any hidden costs (say in processing time) when I convert most of these factors back into strings?

like image 404
John Coleman Avatar asked Nov 09 '22 00:11

John Coleman


1 Answers

Essentially, I think what you need to do is:

df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
df <- droplevels(df)

droplevelsfunction will remove all the unused factor levels, and thus reduce the size of df immensely.

like image 71
Feng Avatar answered Nov 15 '22 08:11

Feng