Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sum(.) on a factor column returns incorrect result

Tags:

r

data.table

I am in a strange fix here. I am using data.table for a very routine task, but there is something that I am not able to explain. I have figured out a way around the problem, but I think it is still important for me to understand what is going wrong here.

This code will bring the data into workspace:

library(XML)
library(data.table)
theurl <- "http://goo.gl/hOKW3a"
tables <- readHTMLTable(theurl)
new.Res <- data.table(tables[[2]][4:5][-(1:2),])
suppressWarnings(names(new.Res) <- c("Party","Cases"))

There are two columns here, Party and Cases. Both of which have the default class of factor. Although, Cases should be numeric. Ultimately, I just want the sum of Cases for each Party. So something like this should work:

new.Res[,sum(Cases), by=Party]

But this doesn't give the right answer. I thought that it'll work if I change the class of Cases from factor to numeric. So I tried the following:

new.Res[,Cases := as.numeric(Cases)]
new.Res[,sum(Cases), by=Party]

But I got the same incorrect answer. I realized that the problem is happening in changing the class of Cases from factor to numeric. So I tried a different method, and it worked:

Step1: Reinitialize the data:

theurl <- "http://goo.gl/hOKW3a"
tables <- readHTMLTable(theurl)
new.Res <- data.table(tables[[2]][4:5][-(1:2),])
suppressWarnings(names(new.Res) <- c("Party","Cases"))

Step2: Use a different method to change the class from factor to numeric:

new.Res[,Cases := strtoi(Cases)]
new.Res[,sum(Cases), by=Party]

This works fine! However, I am not sure what's wrong with the first two methods. What am I missing?

like image 758
Shambho Avatar asked Feb 14 '23 03:02

Shambho


1 Answers

The correct way to convert from factor to numeric or integer is to go through character. This is because internally, a factor is an integer index (that refers to a levels vector). When you tell R to convert it to numeric it will simply convert the underlying index, not try to convert the level label.

Short answer: do Cases:=as.numeric(as.character(Cases)).

Edit: Alternatively the ?factor help page suggests as.numeric(levels(Cases))[Cases] as more efficient. h/t @Gsee in the comments.

like image 75
ilir Avatar answered Feb 15 '23 18:02

ilir