R - From Factor to Numeric or Integer error



I have a dataframe in R that I loaded from a CSV file. One of the variables is called "Amount" and is meant to contain positive and negative numbers.

When I looked at the dataframe, this variable's datatype is listed as a factor, and I need it in a numeric format (Not sure which kind though - integer - numeric, umm...?). So, I tried to convert it to one of those two formats but saw some interesting behavior.

Initial dataframe:


Amount        : Factor w/ 11837 levels "","-1","-10",..: 2 2 1664 4 6290 6290 6290 6290 6290 6290 ...

As I mentioned above, I saw something weird when I tried to convert it to either numeric or integer. To show this, I put together this comparison:

df2 <- data.frame(df$Amount, as.numeric(df$Amount), as.integer(df$Amount))

'data.frame':   2620276 obs. of  3 variables:
 $ df.Amount            : Factor w/ 11837 levels "","-1","-10",..: 2 2 1664 4 6290 6290 6290 6290 6290 6290 ...
 $ as.numeric.df.Amount.: num  2 2 1664 4 6290 ...
 $ as.integer.df.Amount.: int  2 2 1664 4 6290 6290 6290 6290 6290 6290 ...

> head(df2, 20)
         df.Amount        as.numeric.df.Amount.       as.integer.df.Amount.
1               -1                           2                           2
2               -1                           2                           2
3             -201                        1664                        1664
4             -100                           4                           4
5                1                        6290                        6290
6                1                        6290                        6290
7                1                        6290                        6290
8                1                        6290                        6290
9                1                        6290                        6290
10               1                        6290                        6290
11               1                        6290                        6290
12               1                        6290                        6290
13               1                        6290                        6290
14               1                        6290                        6290
15               1                        6290                        6290
16               1                        6290                        6290
17               1                        6290                        6290
18               2                        7520                        7520
19               2                        7520                        7520
20               2                        7520                        7520

The as.numeric and as.integer functions are taking the Amount variable and doing something to it, but I don't know that that is. My goal is to get the Amount variable into some sort of a number data type so I can perform sum/mean/etc on it.

What I am I doing incorrectly that's causing the weird numbers, and what can I do to fix it?

1 Answers

The root of the problem is likely some funky value in your imported csv. If it came from excel, this is not uncommon. It can be a percent symbol, a "comment" character from excel or any of a long list of things. I would look at the csv in your editor of choice and see what you can see.

Aside from that, you have a few options.

read.csv takes an optional argument stringsAsFactors which you can set to FALSE

A factor is stored as integer levels which map to values. When you convert directly with as.numeric you wind up with those integer levels rather than the initial values:

> x<-10:20
> as.numeric(factor(x))
 [1]  1  2  3  4  5  6  7  8  9 10 11

otherwise look at ?factor:

In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

However, I suspect this will error because the input has something in it besides a number.

