I have a 5Gb .dat file (> 10million lines). The format of each line is like aaaa bb cccc0123 xxx kkkkkkkkkkkkkk
or aaaaabbbcccc01234xxxkkkkkkkkkkkkkk
for example. Because readLines
has poor performance while reading big file, I choose fread()
to read this, but error was occurred:
library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") :
Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
Unable to find 5 lines with expected number of columns (+ middle)
How to use fread()
as readLines()
without auto column detecting? Or is there any other way to solve this problem?
Here's a trick. You could use a sep
value that you know is not in the file. Doing that forces fread()
to read the whole line as a single column. Then we can drop that column to an atomic vector (shown as [[1L]]
below). Here's an example on a csv where I use ?
as the sep
. This way it acts similar to readLines()
, only a lot faster.
f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"
Other uncommon characters you can try in sep
are \ ^ @ # =
and others. We can see that this will produce the same output as readLines()
. It's just a matter of finding a sep
value that is not present in the file.
head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"
Note: As @Cath has mentioned in the comments, you could also simply use the line break character \n
as the sep
value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With