I have some data: <pre class="prettyprint"><code>transaction <- c(1,2,3); date <- c("2010-01-31","2010-02-28","2010-03-31"); type <- c("debit", "debit", "credit"); amount <- c(-500, -1000.97, 12500.81); oldbalance <- c(5000, 4500, 17000.81) evolution <- data.frame(transaction, date, type, amount, oldbalance, row.names=transaction, stringsAsFactors=FALSE); evolution$date <- as.Date(evolution$date, "%Y-%m-%d"); evolution <- transform(evolution, newbalance = oldbalance + amount); evolution </code></pre> If I enter the command: <pre class="prettyprint"><code>type <- factor(type) </code></pre> where <code>type</code> is nominal (categorical) variable,then what difference does it make to my data? Thanks

It all depends on what question you are asking of the data! <pre class="prettyprint"><code>type.c <- c("debit", "debit", "credit") type.f <- factor(type.c) </code></pre> Here type.c is just a list of character strings, whereas type.f is a list of factors (is this correct? or is it an array?) <pre class="prettyprint"><code>storage.mode(type.c) # [1] "character" storage.mode(type.f) # [1] "integer" </code></pre> when a factor variable is created it looks through all of the values that have been given and creates the "levels"... have a peek at: <pre class="prettyprint"><code> levels(type.f) # [1] "credit" "debit" </code></pre> Then instead of storing the character strings "debit" "credit" "mis-spelt debbit" etc... it just stores the integer along with the levels... have a look at: <pre class="prettyprint"><code>str(type.f) # Factor w/ 2 levels "credit","debit": 2 2 1 </code></pre> i.e. in type.c it says c("debit", "debit",",credit") and levels(type.f) says "credit" "debit", you see that str(type.f) starts listing the first few values as they are stored, i.e. 2 2 1... If you mis-type "debbit" and add it to the list, and then later do a levels(type.f) you'll see it as a new level... otherwise you could do table(type.c). When there are only three elements in the list, it doesn't make much difference to the storage volume, but as your list gets longer, "credit" (6 characters) and "debit" (5 characters) will start take up much more storage than the 4 bytes it takes to hold an integer (plus the couple of bytes). A little experiment shows that for a randomly selected set of type.c, the threshold on object.size(type.c)>object.size(type.f) is about 96 elements. <pre class="prettyprint"><code>dc <- c("debit", "credit") N <- 300 # lets store the calculations as a matrix # col1 = n # col2 = sizeof(character) # col3 = sizeof(factors) res <- matrix(ncol=3, nrow=N) for (i in c(1:N)) { type.c <- sample(dc, i, replace=T) type.f <- factor(type.c) res[i, 1] <- i res[i, 2] <- object.size(type.c) res[i, 3] <- object.size(type.f) cat('N=', i, ' object.size(type.c)=',object.size(type.c), ' object.size(type.f)=',object.size(type.f), '\n') } plot(res[,1], res[,2], col='blue', type='l', xlab='Number of items in type.x', ylab='bytes of storage') lines(res[,1], res[,3], col='red') mtext('blue for character; red for factor') cat('Threshold at:', min(which(res[,2]>res[,3])), '\n') </code></pre> Apologies for lack of R'ness as I thought it would help with clarity.

R: use of factor

Tags:

types

r

I have some data:

Click to copy

transaction <- c(1,2,3);
date <- c("2010-01-31","2010-02-28","2010-03-31");
type <- c("debit", "debit", "credit");
amount <- c(-500, -1000.97, 12500.81);
oldbalance <- c(5000, 4500, 17000.81)
evolution <- data.frame(transaction, date, type, amount, oldbalance, row.names=transaction,  stringsAsFactors=FALSE);
evolution$date <- as.Date(evolution$date, "%Y-%m-%d");
evolution <- transform(evolution, newbalance = oldbalance + amount);
evolution

If I enter the command:

Click to copy

type <- factor(type)

where type is nominal (categorical) variable,then what difference does it make to my data?

Thanks

662

asked Dec 28 '11 06:12

yCalleecharan

2 Answers

Factors vs character vectors when doing stats: In terms of doing statistics, there's no difference in how R treats factors and character vectors. In fact, its often easier to leave factor variables as character vectors.

If you do a regression or ANOVA with lm() with a character vector as a categorical variable you'll get normal model output but with the message:

Click to copy

Warning message:
In model.matrix.default(mt, mf, contrasts) :
  variable 'character_x' converted to a factor

Factors vs character vectors when manipulating dataframes: When manipulating dataframes, however, character vectors and factors are treated very differently. Some information on the annoyances of R & factors can be found on the Quantum Forest blog, R pitfall #3: friggin’ factors.

Its useful to use stringsAsFactors = FALSE when reading data in from a .csv or .txt using read.table or read.csv. As noted in another reply you have to make sure that everything in your character vector is consistent, or else every typo will be designated as a different factor. You can use the function gsub() to fix typos.

Here is a worked example showing how lm() gives you the same results with a character vector and a factor.

A random independent variable:

Click to copy

continuous_x <- rnorm(10,10,3)

A random categorical variable as a character vector:

Click to copy

character_x  <- (rep(c("dog","cat"),5))

Convert the character vector to a factor variable. factor_x <- as.factor(character_x)

Give the two categories random values:

Click to copy

character_x_value <- ifelse(character_x == "dog", 5*rnorm(1,0,1), rnorm(1,0,2))

Create a random relationship between the indepdent variables and a dependent variable

Click to copy

continuous_y <- continuous_x*10*rnorm(1,0) + character_x_value

Compare the output of a linear model with the factor variable and the character vector. Note the warning that is given with the character vector.

Click to copy

summary(lm(continuous_y ~ continuous_x + factor_x))
summary(lm(continuous_y ~ continuous_x + character_x))

104

answered Sep 30 '22 15:09

N Brouwer

It all depends on what question you are asking of the data!

Click to copy

type.c <- c("debit", "debit", "credit")
type.f <- factor(type.c)

Here type.c is just a list of character strings, whereas type.f is a list of factors (is this correct? or is it an array?)

Click to copy

storage.mode(type.c)
# [1] "character"
storage.mode(type.f)
# [1] "integer"

when a factor variable is created it looks through all of the values that have been given and creates the "levels"... have a peek at:

Click to copy

 levels(type.f)
 # [1] "credit" "debit"

Then instead of storing the character strings "debit" "credit" "mis-spelt debbit" etc... it just stores the integer along with the levels... have a look at:

Click to copy

str(type.f)
# Factor w/ 2 levels "credit","debit": 2 2 1

i.e. in type.c it says c("debit", "debit",",credit") and levels(type.f) says "credit" "debit", you see that str(type.f) starts listing the first few values as they are stored, i.e. 2 2 1...

If you mis-type "debbit" and add it to the list, and then later do a levels(type.f) you'll see it as a new level... otherwise you could do table(type.c).

When there are only three elements in the list, it doesn't make much difference to the storage volume, but as your list gets longer, "credit" (6 characters) and "debit" (5 characters) will start take up much more storage than the 4 bytes it takes to hold an integer (plus the couple of bytes). A little experiment shows that for a randomly selected set of type.c, the threshold on object.size(type.c)>object.size(type.f) is about 96 elements.

Click to copy

dc <- c("debit", "credit")
N <- 300

# lets store the calculations as a matrix
# col1 = n
# col2 = sizeof(character)
# col3 = sizeof(factors)
res <- matrix(ncol=3, nrow=N)

for (i in c(1:N)) {
  type.c <- sample(dc, i, replace=T)
  type.f <- factor(type.c)
  res[i, 1] <- i
  res[i, 2] <- object.size(type.c)
  res[i, 3] <- object.size(type.f)
  cat('N=', i, '  object.size(type.c)=',object.size(type.c), '  object.size(type.f)=',object.size(type.f), '\n')
}
plot(res[,1], res[,2], col='blue', type='l', xlab='Number of items in type.x', ylab='bytes of storage')
lines(res[,1], res[,3], col='red')
mtext('blue for character; red for factor')

cat('Threshold at:', min(which(res[,2]>res[,3])), '\n')

Apologies for lack of R'ness as I thought it would help with clarity.

answered Sep 30 '22 13:09

Sean

Related questions
                            
                                How can I use a graphic imported with grImport as axis tick labels in ggplot2 (using grid functions)?
                            
                                Get the strings before the comma with R
                            
                                How to get the confidence intervals for LOWESS fit using R?
                            
                                R shiny Observe running Before loading of UI and this causes Null parameters
                            
                                Converting date in Year.decimal form in R
                            
                                How to convert time difference into minutes in R?
                            
                                Replicating rows in data.table by column value
                            
                                Convert a list into a string
                            
                                Collapsing / hiding figures in R markdown
                            
                                How to stop bookdown tables from floating to bottom of the page in pdf?
                            
                                Why does as.factor return a character when used inside apply?
                            
                                read.csv row.names
                            
                                How to create a KML file using R
                            
                                SI prefixes in ggplot2 axis labels
                            
                                Combine two data frames with the same column names
                            
                                Mutating multiple columns in a data frame using dplyr
                            
                                R DBI ODBC error: nanodbc/nanodbc.cpp:3110: 07009: [Microsoft][ODBC Driver 13 for SQL Server]Invalid Descriptor Index
                            
                                Library/tool for drawing ternary/triangle plots [closed]
                            
                                Installing all CRAN packages that are not already installed?
                            
                                Overlay data onto background image

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R: use of factor

Tags:

types

r

yCalleecharan

People also ask

2 Answers

N Brouwer

Sean

Recent Activity

Donate For Us