Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use fread() as readLines() without auto column detection?

Tags:

r

data.table

I have a 5Gb .dat file (> 10million lines). The format of each line is like aaaa bb cccc0123 xxx kkkkkkkkkkkkkk or aaaaabbbcccc01234xxxkkkkkkkkkkkkkk for example. Because readLines has poor performance while reading big file, I choose fread() to read this, but error was occurred:

library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") : 
  Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
  Unable to find 5 lines with expected number of columns (+ middle)

How to use fread() as readLines() without auto column detecting? Or is there any other way to solve this problem?

like image 238
Eric Chang Avatar asked Oct 03 '15 07:10

Eric Chang


1 Answers

Here's a trick. You could use a sep value that you know is not in the file. Doing that forces fread() to read the whole line as a single column. Then we can drop that column to an atomic vector (shown as [[1L]] below). Here's an example on a csv where I use ? as the sep. This way it acts similar to readLines(), only a lot faster.

f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"       
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"  
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,," 
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

Other uncommon characters you can try in sep are \ ^ @ # = and others. We can see that this will produce the same output as readLines(). It's just a matter of finding a sep value that is not present in the file.

head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                  
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                             
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                            
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                           
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,," 

Note: As @Cath has mentioned in the comments, you could also simply use the line break character \n as the sep value.

like image 74
Rich Scriven Avatar answered Oct 13 '22 02:10

Rich Scriven