Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table::fread's stringsAsFactors=TRUE argument doesn't convert character columns to factor type - what's the workaround?

I know this issue has been raised in several places and I have been trying to find out a possible good solution for hours but failed. That's why I'm asking this.

So, I have a huge data file (~5GB) and I used fread() to read this

library(data.table)
df<- fread('output.txt', sep = "|", stringsAsFactors = TRUE)
head(df, 5)
       age            income homeowner_status_desc marital_status_cd gender
1:         $35,000 - $49,999                                               
2: 35 - 44 $35,000 - $49,999                  Rent            Single      F
3:         $35,000 - $49,999                                               
4:                                                                         
5:         $50,000 - $74,999 
str(df)
Classes ‘data.table’ and 'data.frame':  999 obs. of  5 variables:
 $ age                  : chr  "" "35 - 44" "" "" ...
 $ income               : chr  "$35,000 - $49,999" "$35,000 - $49,999" "$35,000 - $49,999" "" ...
 $ homeowner_status_desc: chr  "" "Rent" "" "" ...
 $ marital_status_cd    : chr  "" "Single" "" "" ...
 $ gender               : chr  "" "F" "" "" ...
 - attr(*, ".internal.selfref")=<externalptr> 

There are missing data(where it's blank). In the original data, there are lots of columns and thus I need to find a way to make columns Factor whenever columns include strings. Could anyone suggest what is the best practice to get this done? I was considering changing it to data frame and do this. But is it possible to do this while it's a data.table?

like image 263
hmi2015 Avatar asked Jul 10 '15 21:07

hmi2015


People also ask

How do I convert a character column to a factor in R?

By using the apply() and sapply() functions, we were able to convert only the character columns to factor columns and leave all other columns unchanged.

What does the stringsAsFactors argument do?

The argument 'stringsAsFactors' is an argument to the 'data. frame()' function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in 'read.

How do I convert a datatype to factor in R?

Converting DataFrame Column To Factor Column Similarly, a dataframe column can be converted to factor type, by referring to the particular data column using df$col-name command in R.

What is stringsAsFactors false?

Sometimes a string is just a string. It is often claimed Sigmund Freud said “Sometimes a cigar is just a cigar.” To avoid problems delay re-encoding of strings by using stringsAsFactors = FALSE when creating data.


1 Answers

Just implemented stringsAsFactors argument for fread in v 1.9.6+

From NEWS:

  1. Implemented stringsAsFactors argument for fread(). When TRUE, character columns are converted to factors. Default is FALSE. Thanks to Artem Klevtsov for filing #501, and to @hmi2015 for this SO post.
like image 100
Arun Avatar answered Oct 29 '22 03:10

Arun