Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bypassing "ghost" line break or end of file (EOF) in data.table::fread

I'm loading several large, tab-delimited text files exported from an (accessible to me) database into R using data.table::fread. fread handles most of the files with great ease and speed, but one of the files is generating a regularly-reported fread error:

Error in fread(read_problem, encoding = "UTF-8", na.strings = "", header = TRUE,  : 
                   Expected sep ('  ') but new line or EOF ends field ...

A smaller (2000 rows) version of the file containing the offending line is available here (RDS file).

Here's how I've tried to diagnose the problem to this point:

library(data.table) # I'm using 1.9.7 development (same error with 1.9.6)
read_problem <- readRDS("read_problem.rds")
error <- fread(read_problem, encoding = "UTF-8", na.strings = "",
               header = TRUE, sep = "\t", 
               colClasses = rep("character", 44),  # For simplicity
               verbose = TRUE)

If I excise the offending line, the problem disappears:

cat(read_problem, file = "temp")   
string_vec <- readLines("temp")
clipped_vec <- string_vec[-1027] # Get rid of problem line 1027
restored <- paste(clipped_vec, collapse = "\n")
noerror <- fread(restored, encoding = "UTF-8", na.strings = "",
                 header = TRUE, sep = "\t", 
                 colClasses = rep("character", 44)) # For simplicity
class(noerror)
[1] "data.table" "data.frame"

dim(noerror) 
[1] 1999   44

The error message seems clear enough: fread is looking for a "\t" but is finding something else in its place.

But I find nothing obvious from a closer look at the offending line relative to those around it.

The number of tab characters is the same

sapply(gregexpr("\t", string_vec[1026:1028]), length)
[1] 43 43 43

Line break information seems identical

unlist(gregexpr("\n", string_vec[1026:1028]))
[1] -1 -1 -1

Here's a look at the offending line itself as a string:

string_vec[1027]
[1] "URN:CornellLabOfOrnithology:EBIRD:OBS132960387\t29816\tspecies\tNelson's Sparrow\tAmmodramus nelsoni\t\t\t1\t\t\tUnited States\tUS\tGeorgia\tUS-GA\tGlynn\tUS-GA-127\tUS-GA_3181\t\t\tJekyll Island\tL140461\tH\t31.0464993\t-81.4113007\t1990-11-03\t13:15:00\t\"Jekyll Island and Causeway. Partly cloudy, mild, NE wind 8-15 mph. Note: Did very little birding in upland habitats as time available was rather brief.\"  Data entered on behalf of Paul Sykes by Alison Huff ([email protected]) on 12-15-11.\tListed on old Georgia Field Checklist as \"Sparrow, Sharp-tailed.\"\tobsr289931\tPaul\tSykes\tS9336358\teBird - Traveling Count\tEBIRD\t270\t8.047\t\t1\t1\t\t1\t0\t\t"

Any advice to get around this problem without the manual extraction of offending lines?

like image 989
adamdsmith Avatar asked Sep 30 '15 16:09

adamdsmith


1 Answers

With with this commit, this is now fixed in v1.9.7, current development version. The next stable release should therefore be able to read this properly using quote="".

require(data.table) #v1.9.7+
fread('"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n', quote="")
#                    V1                V2 V3
# 1: "abcd efgh." ijkl. mnop "qrst uvwx." 45

On the 1027th line, at the end of "Sparrow, Sharp-tailed." there's only one tab. Where as in the other lines, after that field, there are two before the "obsr[0-9]" field starts.

The number of tabs seem to match because, on line 1027, there's a tab before "Listed on old Georgia Field" instead of a space..

Therefore line 1027 gets only 43 cols instead of 44. This seems to be the issue.


Looking at this again, it seems like Listed on old Georgia Field Checklist as "Sparrow, Sharp-tailed." should be read as a separate column but instead is being read together with the previous column...

Here's a smaller reproducible example:

# note that there are only 2 instead of 3 columns
fread('"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n')
#                                     V1 V2
# 1: abcd efgh." ijkl.\tmnop "qrst uvwx. 45

# add a header column and it returns the same error
fread('a\tb\tc\n"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n')
# Error in fread("a\tb\tc\n\"abcd efgh.\" ijkl.\tmnop \"qrst uvwx.\"\t45\n") : 
#   Expected sep (' ') but new line, EOF (or other non printing character) 
#   ends field 1 when detecting types (   first): "abcd efgh." ijkl.    mnop 
#   "qrst uvwx."    45

Filed 1367.

like image 56
Arun Avatar answered Oct 11 '22 03:10

Arun