I have to work with a collection of 120 files of ~2 GB (525600 lines x 302 columns). The goal is to make some statistics and put the results in a clean SQLite database.
Everything works fine when my script import with read.table(), but it's slow. So I've tried with fread, from the data.table package (version 1.9.2), but it give me this error :
Error in fread(txt, header = T, select = c("YYY", "MM", "DD", :
Not positioned correctly after testing format of header row. ch=' '
The first 2 lines and 7 rows of my data look like that :
YYYY MM DD HH mm 19490 40790
1991 10 1 1 0 1.046465E+00 1.568405E+00
So, there is a first space at beginning, then only one space between date columns, then an arbitrary number of spaces between the others columns.
I've tried to use a command like this to convert spaces in comma :
DT <- fread(
paste("sed 's/\\s\\+/,/g'", txt),
header=T,
select=c('HHHH','MM','DD','HH')
)
without success : the problem remains and it seems to be slow with the sed command.
Fread doesn't seems to like "arbitrary number of space" as separator or empty column at beginning. Any idea ?
Here is a (maybe) smallest reproducible example (newline char after 40790) :
txt<-print(" YYYY MM DD HH mm 19490 40790
1991 10 1 1 0 1.046465E+00 1.568405E+00")
testDT<-fread(txt,
header=T,
select=c("YYY","MM","DD","HH")
)
Thanks for your help !
UPDATE : - The error doesn't occurs with data.table 1.8.* . With this version, the table is read as one unique line, which is not better.
UPDATE 2 - As mentioned in comments, I could use sed to format the table and then read it with fread. I've put a script in an answer above where I create a sample dataset and then, compare some system.time ().
Just committed to devel, v1.9.5. fread()
gains strip.white
argument with default TRUE
(as opposed to base::read.table()
, because it's more desirable). The example data is now added to tests.
With this recent commit:
require(data.table) # v1.9.5, commit 0e7a835 or more recent
ans <- fread(" YYYY MM DD HH mm 19490 40790\n 1991 10 1 1 0 1.046465E+00 1.568405E+00")
# V1 V2 V3 V4 V5 V6 V7
# 1: YYYY MM DD HH mm 19490.000000 40790.000000
# 2: 1991 10 1 1 0 1.046465 1.568405
sapply(ans, class)
# V1 V2 V3 V4 V5 V6 V7
# "character" "character" "character" "character" "character" "numeric" "numeric"
sed 's/^[[:blank:]]*//;s/[[:blank:]]\{1,\}/,/g'
for you sed
it's not possible to collect all result of fread into 1 (temporary) file (adding the source reference) and treat this file with sed (or other tool) to avoid a fork of the tools at every iteration ?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With