In large numbers, commas are used to help the reader. A comma is placed every third digit to the left of the decimal point and so is used in numbers with four or more digits. Continue to place a comma after every third digit.
We can parse a number string with commas thousand separators into a number by removing the commas, and then use the + operator to do the conversion. We call replace with /,/g to match all commas and replace them all with empty strings.
To convert character to numeric in R, use the as. numeric() function. The as. numeric() is a built-in R function that creates or coerces objects of type “numeric”.
Not sure about how to have read.csv
interpret it properly, but you can use gsub
to replace ","
with ""
, and then convert the string to numeric
using as.numeric
:
y <- c("1,200","20,000","100","12,111")
as.numeric(gsub(",", "", y))
# [1] 1200 20000 100 12111
This was also answered previously on R-Help (and in Q2 here).
Alternatively, you can pre-process the file, for instance with sed
in unix.
You can have read.table or read.csv do this conversion for you semi-automatically. First create a new class definition, then create a conversion function and set it as an "as" method using the setAs function like so:
setClass("num.with.commas")
setAs("character", "num.with.commas",
function(from) as.numeric(gsub(",", "", from) ) )
Then run read.csv like:
DF <- read.csv('your.file.here',
colClasses=c('num.with.commas','factor','character','numeric','num.with.commas'))
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub
, I think this is about as neat as I can do:
x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
This question is several years old, but I stumbled upon it, which means maybe others will.
The readr
library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.
library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
col_types = list(col_numeric())
)
This yields
Source: local data frame [4 x 1]
numbers
(dbl)
1 800.0
2 1800.0
3 3500.0
4 6.5
An important point when reading in files: you either have to pre-process, like the comment above regarding sed
, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)
For instance, if I had not flagged the col_types
, I would have gotten this:
> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]
numbers
(chr)
1 800
2 1,800
3 3500
4 6.5
(Notice that it is now a chr
(character
) instead of a numeric
.)
Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")
(such that the last few elements look like:)
\"5\"\n\"9\"\n\"7\"\n\"1,003"
Then you'll find trouble reading that comma at all!
> tail(read_csv(tmp))
Source: local data frame [6 x 1]
3"
(dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details.
dplyr
solution using mutate_all
and pipessay you have the following:
> dft
Source: local data frame [11 x 5]
Bureau.Name Account.Code X2014 X2015 X2016
1 Senate 110 158,000 211,000 186,000
2 Senate 115 0 0 0
3 Senate 123 15,000 71,000 21,000
4 Senate 126 6,000 14,000 8,000
5 Senate 127 110,000 234,000 134,000
6 Senate 128 120,000 159,000 134,000
7 Senate 129 0 0 0
8 Senate 130 368,000 465,000 441,000
9 Senate 132 0 0 0
10 Senate 140 0 0 0
11 Senate 140 0 0 0
and want to remove commas from the year variables X2014-X2016, and convert them to numeric. also, let's say X2014-X2016 are read in as factors (default)
dft %>%
mutate_all(funs(as.character(.)), X2014:X2016) %>%
mutate_all(funs(gsub(",", "", .)), X2014:X2016) %>%
mutate_all(funs(as.numeric(.)), X2014:X2016)
mutate_all
applies the function(s) inside funs
to the specified columns
I did it sequentially, one function at a time (if you use multiple
functions inside funs
then you create additional, unnecessary columns)
"Preprocess" in R:
lines <- "www, rrr, 1,234, ttt \n rrr,zzz, 1,234,567,987, rrr"
Can use readLines
on a textConnection
. Then remove only the commas that are between digits:
gsub("([0-9]+)\\,([0-9])", "\\1\\2", lines)
## [1] "www, rrr, 1234, ttt \n rrr,zzz, 1234567987, rrr"
It's als useful to know but not directly relevant to this question that commas as decimal separators can be handled by read.csv2 (automagically) or read.table(with setting of the 'dec'-parameter).
Edit: Later I discovered how to use colClasses by designing a new class. See:
How to load df with 1000 separator in R as numeric class?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With