Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table fread command : how to read large files with irregular separators?

I have to work with a collection of 120 files of ~2 GB (525600 lines x 302 columns). The goal is to make some statistics and put the results in a clean SQLite database.

Everything works fine when my script import with read.table(), but it's slow. So I've tried with fread, from the data.table package (version 1.9.2), but it give me this error :

Error in fread(txt, header = T, select = c("YYY", "MM", "DD",  : 
Not positioned correctly after testing format of header row. ch=' '

The first 2 lines and 7 rows of my data look like that :

 YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00

So, there is a first space at beginning, then only one space between date columns, then an arbitrary number of spaces between the others columns.

I've tried to use a command like this to convert spaces in comma :

DT <- fread(
            paste("sed 's/\\s\\+/,/g'", txt),
            header=T,
            select=c('HHHH','MM','DD','HH')
)

without success : the problem remains and it seems to be slow with the sed command.

Fread doesn't seems to like "arbitrary number of space" as separator or empty column at beginning. Any idea ?

Here is a (maybe) smallest reproducible example (newline char after 40790) :

txt<-print(" YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00")

testDT<-fread(txt,
              header=T,
              select=c("YYY","MM","DD","HH")
)

Thanks for your help !

UPDATE : - The error doesn't occurs with data.table 1.8.* . With this version, the table is read as one unique line, which is not better.

UPDATE 2 - As mentioned in comments, I could use sed to format the table and then read it with fread. I've put a script in an answer above where I create a sample dataset and then, compare some system.time ().

like image 639
fxi Avatar asked Dec 05 '22 07:12

fxi


2 Answers

Just committed to devel, v1.9.5. fread() gains strip.white argument with default TRUE (as opposed to base::read.table(), because it's more desirable). The example data is now added to tests.

With this recent commit:

require(data.table) # v1.9.5, commit 0e7a835 or more recent
ans <- fread(" YYYY MM DD HH mm             19490             40790\n   1991 10  1  1  0      1.046465E+00      1.568405E+00")
#      V1 V2 V3 V4 V5           V6           V7
# 1: YYYY MM DD HH mm 19490.000000 40790.000000
# 2: 1991 10  1  1  0     1.046465     1.568405
sapply(ans, class)
#          V1          V2          V3          V4          V5          V6          V7 
# "character" "character" "character" "character" "character"   "numeric"   "numeric" 
like image 140
Arun Avatar answered May 15 '23 04:05

Arun


sed 's/^[[:blank:]]*//;s/[[:blank:]]\{1,\}/,/g' 

for you sed

it's not possible to collect all result of fread into 1 (temporary) file (adding the source reference) and treat this file with sed (or other tool) to avoid a fork of the tools at every iteration ?

like image 23
NeronLeVelu Avatar answered May 15 '23 04:05

NeronLeVelu