Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table() still converts strings to factors?

Tags:

r

data.table

From what I can see here I would assume that data.table v1.8.0+ does not automatically convert strings to factors.

Specifically, to quote Matthew Dowle from that page:

No need for stringsAsFactors. Done like this in v1.8.0 : o character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported.

I'm not seeing that ... here's my R session transcript:

First, I make sure I have a recent enough version of data.table > 1.8.0

> library(data.table)
data.table 1.8.8  For help type: help("data.table")

Next, I create a 2x2 data.table. Notice that it creates factors ...

> m <- matrix(letters[1:4], ncol=2)
> str(data.table(m))
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ V1: Factor w/ 2 levels "a","b": 1 2
 $ V2: Factor w/ 2 levels "c","d": 1 2
 - attr(*, ".internal.selfref")=<externalptr> 

When I use stringsAsFactors in data.frame() and then call data.table(), all is well ...

> str(data.table(data.frame(m, stringsAsFactors=FALSE)))
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ X1: chr  "a" "b"
 $ X2: chr  "c" "d"
 - attr(*, ".internal.selfref")=<externalptr> 

What am I missing? Is data.frame() supposed to convert strings to factors, and if so, is there a "better way" of turning that behavior off?

Thanks!

like image 965
vijay Avatar asked Jul 17 '13 04:07

vijay


People also ask

What does string as factors mean in R?

The argument 'stringsAsFactors' is an argument to the 'data. frame()' function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in 'read.


2 Answers

Update:

This issue seems to have slipped past somehow until now. Thanks to @fpinter for filing the issue recently. It is now fixed in commit 1322. From NEWS, No:39 under bug fixes for v1.9.3:

as.data.table.matrix does not convert strings to factors by default. data.table likes and prefers using character vectors to factors. Closes #745. Thanks to @fpinter for reporting the issue on the github issue tracker and to vijay for reporting here on SO.


It appears that this non-coercion is not yet implemented.

data.table deals with matrix arguments using as.data.table

if (is.matrix(xi) || is.data.frame(xi)) {
            xi = as.data.table(xi, keep.rownames = keep.rownames)
            x[[i]] = xi
            numcols[i] = length(xi)
        }

and

as.data.table.matrix

contains

if (mode(x) == "character") {
        for (i in ic) value[[i]] <- as.factor(x[, i])
    }

Might be worth reporting this to the bug tracker. (it is still implemented in 1.8.9, the current r-forge version)

like image 82
mnel Avatar answered Sep 17 '22 15:09

mnel


As a workaround and to complete @mnel answer, if you want to turn off the default behavior of data.frame you can use the dedicated option.

options(stringsAsFactors=FALSE)

str(data.table(data.frame(m)))
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ X1: chr  "a" "b"
 $ X2: chr  "c" "d"
 - attr(*, ".internal.selfref")=<externalptr> 
like image 38
dickoa Avatar answered Sep 20 '22 15:09

dickoa