Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: use of factor

Tags:

types

r

I have some data:

transaction <- c(1,2,3);
date <- c("2010-01-31","2010-02-28","2010-03-31");
type <- c("debit", "debit", "credit");
amount <- c(-500, -1000.97, 12500.81);
oldbalance <- c(5000, 4500, 17000.81)
evolution <- data.frame(transaction, date, type, amount, oldbalance, row.names=transaction,  stringsAsFactors=FALSE);
evolution$date <- as.Date(evolution$date, "%Y-%m-%d");
evolution <- transform(evolution, newbalance = oldbalance + amount);
evolution

If I enter the command:

type <- factor(type) 

where type is nominal (categorical) variable,then what difference does it make to my data?

Thanks

like image 662
yCalleecharan Avatar asked Dec 28 '11 06:12

yCalleecharan


People also ask

What is the R-factor in sampling?

R-factor. This is a measure of the disagreement between the observed amplitudes (Fo) and the amplitudes calculated from the model (Fc). Depending on the resolution and quality of the diffraction data, well-refined structures have R-factors below 20–25 percent.

What is a factor column in R?

Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and True, False etc. They are useful in data analysis for statistical modeling.


2 Answers

Factors vs character vectors when doing stats: In terms of doing statistics, there's no difference in how R treats factors and character vectors. In fact, its often easier to leave factor variables as character vectors.

If you do a regression or ANOVA with lm() with a character vector as a categorical variable you'll get normal model output but with the message:

Warning message:
In model.matrix.default(mt, mf, contrasts) :
  variable 'character_x' converted to a factor

Factors vs character vectors when manipulating dataframes: When manipulating dataframes, however, character vectors and factors are treated very differently. Some information on the annoyances of R & factors can be found on the Quantum Forest blog, R pitfall #3: friggin’ factors.

Its useful to use stringsAsFactors = FALSE when reading data in from a .csv or .txt using read.table or read.csv. As noted in another reply you have to make sure that everything in your character vector is consistent, or else every typo will be designated as a different factor. You can use the function gsub() to fix typos.

Here is a worked example showing how lm() gives you the same results with a character vector and a factor.

A random independent variable:

continuous_x <- rnorm(10,10,3)

A random categorical variable as a character vector:

character_x  <- (rep(c("dog","cat"),5))

Convert the character vector to a factor variable. factor_x <- as.factor(character_x)

Give the two categories random values:

character_x_value <- ifelse(character_x == "dog", 5*rnorm(1,0,1), rnorm(1,0,2))

Create a random relationship between the indepdent variables and a dependent variable

continuous_y <- continuous_x*10*rnorm(1,0) + character_x_value

Compare the output of a linear model with the factor variable and the character vector. Note the warning that is given with the character vector.

summary(lm(continuous_y ~ continuous_x + factor_x))
summary(lm(continuous_y ~ continuous_x + character_x))
like image 104
N Brouwer Avatar answered Sep 30 '22 15:09

N Brouwer


It all depends on what question you are asking of the data!

type.c <- c("debit", "debit", "credit")
type.f <- factor(type.c)

Here type.c is just a list of character strings, whereas type.f is a list of factors (is this correct? or is it an array?)

storage.mode(type.c)
# [1] "character"
storage.mode(type.f)
# [1] "integer"

when a factor variable is created it looks through all of the values that have been given and creates the "levels"... have a peek at:

 levels(type.f)
 # [1] "credit" "debit"

Then instead of storing the character strings "debit" "credit" "mis-spelt debbit" etc... it just stores the integer along with the levels... have a look at:

str(type.f)
# Factor w/ 2 levels "credit","debit": 2 2 1

i.e. in type.c it says c("debit", "debit",",credit") and levels(type.f) says "credit" "debit", you see that str(type.f) starts listing the first few values as they are stored, i.e. 2 2 1...

If you mis-type "debbit" and add it to the list, and then later do a levels(type.f) you'll see it as a new level... otherwise you could do table(type.c).

When there are only three elements in the list, it doesn't make much difference to the storage volume, but as your list gets longer, "credit" (6 characters) and "debit" (5 characters) will start take up much more storage than the 4 bytes it takes to hold an integer (plus the couple of bytes). A little experiment shows that for a randomly selected set of type.c, the threshold on object.size(type.c)>object.size(type.f) is about 96 elements.

dc <- c("debit", "credit")
N <- 300

# lets store the calculations as a matrix
# col1 = n
# col2 = sizeof(character)
# col3 = sizeof(factors)
res <- matrix(ncol=3, nrow=N)

for (i in c(1:N)) {
  type.c <- sample(dc, i, replace=T)
  type.f <- factor(type.c)
  res[i, 1] <- i
  res[i, 2] <- object.size(type.c)
  res[i, 3] <- object.size(type.f)
  cat('N=', i, '  object.size(type.c)=',object.size(type.c), '  object.size(type.f)=',object.size(type.f), '\n')
}
plot(res[,1], res[,2], col='blue', type='l', xlab='Number of items in type.x', ylab='bytes of storage')
lines(res[,1], res[,3], col='red')
mtext('blue for character; red for factor')

cat('Threshold at:', min(which(res[,2]>res[,3])), '\n')

Apologies for lack of R'ness as I thought it would help with clarity.

like image 21
Sean Avatar answered Sep 30 '22 13:09

Sean