Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How can I read a CSV file with data.table::fread, that has a comma as decimal and point as thousand separator="."

I got several CSV files which contain numbers in the local german style i.e. with a comma as the decimal separator and the point as the thousand separator e.g. 10.380,45. The values in the CSV file are separated by ";". The files also contain columns from the classes character, Date, Date & Time and Logical.

The problem with the read.table functions is, that you can specify the decimal separator with dec=",", but NOT the thousand point separator. (If I'm wrong, please correct me)

I know that preprocessing is a workaround, but I want to write my code in a way, that others can use it without me.

I found a way to read the CSV file the way I want it with read.csv2, by setting my own classes, as can be seen in the following example. Based on Most elegant way to load csv with point as thousands separator in R

# Create test example
df_test_write <- cbind.data.frame(c("a","b","c","d","e","f","g","h","i","j",rep("k",times=200)),
                            c("5.200,39","250,36","1.000.258,25","3,58","5,55","10.550,00","10.333,00","80,33","20.500.000,00","10,00",rep("3.133,33",times=200)),
                            c("25.03.2015","28.04.2015","03.05.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016","08.08.2016",rep("08.08.2016",times=200)),
                            stringsAsFactors=FALSE)
colnames(df_test_write) <- c("col_text","col_num","col_date")

# write test csv
write.csv2(df_test_write,file="Test.csv",quote=FALSE,row.names=FALSE)

#### read with read.csv2 ####

# First, define your own class

#define your own numeric class
setClass('myNum')
#define conversion
setAs("character","myNum", function(from) as.numeric(gsub(",","\\.",gsub("\\.","",from))))

# own date class
library(lubridate)
setClass('myDate')
setAs("character","myDate",function(from) dmy(from))

# Read the csv file, in colClasses the columns class can be defined
df_test_readcsv <- read.csv2(paste0(getwd(),"/Test.csv"),
                       stringsAsFactors = FALSE,
                       colClasses = c(
                         col_text = "character",
                         col_num = "myNum",
                         col_date = "myDate"
                       )
                )

My problem now is, that the different datasets have up to 200 columns and 350000 Rows. With the upper solution I need between 40 and 60 seconds to load one CSV file and I would like to speed this up.

Through my research I found fread() from the data.table package, which is really fast. It takes approximately 3 to 5 seconds to load the CSV file.

Unfortunately there is also no possibility to specify the thousand separator. So I tried to use my solution with colClasses, but there seems to be the issue, that you can't use individual classes with fread https://github.com/Rdatatable/data.table/issues/491

See also my following test code:

##### read with fread ####
library(data.table)

# Test without colclasses
df_test_readfread1 <- fread(paste0(getwd(),"/Test.csv"),
                            stringsAsFactors = FALSE,
                            dec = ",",
                            sep=";",
                            verbose=TRUE)
str(df_test_readfread1)

# PROBLEM: In my real dataset it turns the number into an numeric column, 
# unforunately it sees the "." as decimal separator, so it turns e.g. 10.550, 
# into 10.5
# Here it keeps everything as character

# Test with colclasses
df_test_readfread2 <- fread(paste0(getwd(),"/Test.csv"),
                            stringsAsFactors = FALSE,
                            colClasses = c(
                              col_text = "character",
                              col_num = "myNum",
                              col_date = "myDate"
                            ),
                            sep=";",
                            verbose=TRUE)
str(df_test_readfread2)

# Keeps everything as character

So my question is: Is there a way to read CSV files with numeric values like 10.380,45 with fread?

(Alternatively: What is the fastest way to read a CSV with such numeric values?)

like image 338
PhiSeu Avatar asked Oct 19 '22 03:10

PhiSeu


1 Answers

I never used package myself, but it's from Hadley Wickham, should be good stuff

https://cran.r-project.org/web/packages/readr/readr.pdf

It supposed to handle locales:

locale(date_names = "en", date_format = "%AD", time_format = "%AT", decimal_mark = ".", grouping_mark = ",", tz = "UTC", encoding = "UTF-8", asciify = FALSE)

decimal_mark and grouping_mark is what you're looking for

EDIT form PhiSeu: Solution

Thanks to your suggestion here are two solutions with read_csv2() from the readr package. For my 350000 row CSV file it takes approximately 8 seconds, which is much faster then the read.csv2 solution. (Another helpful package from hadley and RStudio, thanks)

library(readr)

# solution 1 with specified columns
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
                           locale = locale("de"),
                           col_names = TRUE,
                           cols(
                             col_text = col_character(),
                             col_num = col_number(), # number is automatically regcognized through locale=("de")
                             col_date2 = col_date(format ="%d.%m.%Y") # Date specification
                           )
                           )

# solution 2 with overall definition of date format
df_test_readr <- read_csv2(paste0(getwd(),"/Test.csv"),
                           locale = locale("de",date_format = "%d.%m.%Y"), # specifies the date format for the whole file
                           col_names = TRUE
)
like image 194
Severin Pappadeux Avatar answered Nov 03 '22 06:11

Severin Pappadeux