R data.table fread command : how to read large files with irregular separators?

Question

I have to work with a collection of 120 files of ~2 GB (525600 lines x 302 columns). The goal is to make some statistics and put the results in a clean SQLite database.

Everything works fine when my script import with read.table(), but it's slow. So I've tried with fread, from the data.table package (version 1.9.2), but it give me this error :

Error in fread(txt, header = T, select = c("YYY", "MM", "DD",  : 
Not positioned correctly after testing format of header row. ch=' '

The first 2 lines and 7 rows of my data look like that :

 YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00

So, there is a first space at beginning, then only one space between date columns, then an arbitrary number of spaces between the others columns.

I've tried to use a command like this to convert spaces in comma :

DT <- fread(
            paste("sed 's/\s\+/,/g'", txt),
            header=T,
            select=c('HHHH','MM','DD','HH')
)

without success : the problem remains and it seems to be slow with the sed command.

Fread doesn't seems to like "arbitrary number of space" as separator or empty column at beginning. Any idea ?

Here is a (maybe) smallest reproducible example (newline char after 40790) :

txt<-print(" YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00")

testDT<-fread(txt,
              header=T,
              select=c("YYY","MM","DD","HH")
)

Thanks for your help !

UPDATE : - The error doesn't occurs with data.table 1.8.* . With this version, the table is read as one unique line, which is not better.

UPDATE 2 - As mentioned in comments, I could use sed to format the table and then read it with fread. I've put a script in an answer above where I create a sample dataset and then, compare some system.time ().

Arun · Accepted Answer

Just committed to devel, v1.9.5. fread() gains strip.white argument with default TRUE (as opposed to base::read.table(), because it's more desirable). The example data is now added to tests.

With this recent commit:

require(data.table) # v1.9.5, commit 0e7a835 or more recent
ans <- fread(" YYYY MM DD HH mm             19490             40790
   1991 10  1  1  0      1.046465E+00      1.568405E+00")
#      V1 V2 V3 V4 V5           V6           V7
# 1: YYYY MM DD HH mm 19490.000000 40790.000000
# 2: 1991 10  1  1  0     1.046465     1.568405
sapply(ans, class)
#          V1          V2          V3          V4          V5          V6          V7 
# "character" "character" "character" "character" "character"   "numeric"   "numeric"

NeronLeVelu · Answer

sed 's/^[[:blank:]]*//;s/[[:blank:]]\{1,\}/,/g'

for you sed

it's not possible to collect all result of fread into 1 (temporary) file (adding the source reference) and treat this file with sed (or other tool) to avoid a fork of the tools at every iteration ?

R data.table fread command : how to read large files with irregular separators?

Tags:

r

sed

data.table

wc

read.table

fxi

2 Answers

Arun

NeronLeVelu

Recent Activity

Donate For Us

R data.table fread command : how to read large files with irregular separators?

Tags:

r

sed

data.table

wc

read.table

fxi

2 Answers

Arun

NeronLeVelu

Related questions

Recent Activity

Donate For Us