R - From Factor to Numeric or Integer error

Tags:

r

I have a dataframe in R that I loaded from a CSV file. One of the variables is called "Amount" and is meant to contain positive and negative numbers.

When I looked at the dataframe, this variable's datatype is listed as a factor, and I need it in a numeric format (Not sure which kind though - integer - numeric, umm...?). So, I tried to convert it to one of those two formats but saw some interesting behavior.

Initial dataframe:

str(df)

Amount        : Factor w/ 11837 levels "","-1","-10",..: 2 2 1664 4 6290 6290 6290 6290 6290 6290 ...

As I mentioned above, I saw something weird when I tried to convert it to either numeric or integer. To show this, I put together this comparison:

df2 <- data.frame(df$Amount, as.numeric(df$Amount), as.integer(df$Amount))

str(df2)
'data.frame':   2620276 obs. of  3 variables:
 $ df.Amount            : Factor w/ 11837 levels "","-1","-10",..: 2 2 1664 4 6290 6290 6290 6290 6290 6290 ...
 $ as.numeric.df.Amount.: num  2 2 1664 4 6290 ...
 $ as.integer.df.Amount.: int  2 2 1664 4 6290 6290 6290 6290 6290 6290 ...

> head(df2, 20)
         df.Amount        as.numeric.df.Amount.       as.integer.df.Amount.
1               -1                           2                           2
2               -1                           2                           2
3             -201                        1664                        1664
4             -100                           4                           4
5                1                        6290                        6290
6                1                        6290                        6290
7                1                        6290                        6290
8                1                        6290                        6290
9                1                        6290                        6290
10               1                        6290                        6290
11               1                        6290                        6290
12               1                        6290                        6290
13               1                        6290                        6290
14               1                        6290                        6290
15               1                        6290                        6290
16               1                        6290                        6290
17               1                        6290                        6290
18               2                        7520                        7520
19               2                        7520                        7520
20               2                        7520                        7520

The as.numeric and as.integer functions are taking the Amount variable and doing something to it, but I don't know that that is. My goal is to get the Amount variable into some sort of a number data type so I can perform sum/mean/etc on it.

What I am I doing incorrectly that's causing the weird numbers, and what can I do to fix it?

302

asked Feb 01 '12 19:02

mikebmassey

1 Answers

The root of the problem is likely some funky value in your imported csv. If it came from excel, this is not uncommon. It can be a percent symbol, a "comment" character from excel or any of a long list of things. I would look at the csv in your editor of choice and see what you can see.

Aside from that, you have a few options.

read.csv takes an optional argument stringsAsFactors which you can set to FALSE

A factor is stored as integer levels which map to values. When you convert directly with as.numeric you wind up with those integer levels rather than the initial values:

> x<-10:20
> as.numeric(factor(x))
 [1]  1  2  3  4  5  6  7  8  9 10 11
>

otherwise look at ?factor:

In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

However, I suspect this will error because the input has something in it besides a number.

115

answered Nov 15 '22 09:11

Justin

Related questions
                            
                                Easily input a correlation matrix in R
                            
                                cut() - include lowest values
                            
                                Splitting a number in R
                            
                                More efficient strategy for which() or match()
                            
                                get filename from url path in R
                            
                                Efficient use of functions on long data.frames in R
                            
                                Add new row to matrix one by one
                            
                                matching and counting strings (k-mer of DNA) in R
                            
                                Replace a set of pattern matches with corresponding replacement strings in R
                            
                                R get rows based on multiple conditions - use dplyr and reshape2
                            
                                Stratified sampling on factor
                            
                                Cannot install devtools package after upgrading R
                            
                                How to remove first N rows in a data set in R? [duplicate]
                            
                                Passing reactive values to conditionalPanel condition
                            
                                Distinct enclosing environment, function environment, etc. in R
                            
                                Plotting a 95% confidence interval for a lm object
                            
                                Is there a base R function to dynamically order data.frame columns similar to dplyr everything()?
                            
                                R: turning list items into objects
                            
                                Apply lm to subset of data frame defined by a third column of the frame
                            
                                Unable to format months with as.Date

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With