I try to load a large datafile (some 20 million rows) using fread()
from the data.table package. Some rows are causing great trouble, however.
Minimal example:
text.csv contains:
id, text
1,"""Oops"",\""The"",""Georgia"""
fread("text.csv", sep=",")
Error in fread("text.csv", sep = ",") :
Not positioned correctly after testing format of header row. ch=','
In addition: Warning message:
In fread("text.csv", sep = ",") :
Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text
read.table()
works somewhat better but is too slow and too memory inefficient.
> read.table("text.csv", header = TRUE, sep=",")
id text
1 1 "Oops",\\"The","Georgia"
I realize that my text file is not properly formatted, but it is too large to edit in a practical manner.
Any help much appreciated.
EDIT:
A small sample of actual data records:
sample1.txt, a good record:
materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
> fread("sample1.txt", sep=",")
materiale_id dk5 description creator subject-phrase
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin NA
title type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music
sample2.txt, a good and a bad record:
materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article
> fread("sample2.txt", sep=",")
Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"...
EDIT 2:
Updating to R version 3.2.3 and data.table 1.9.6. helps on the above, but creates issues with other records:
sample3.txt, a good and a bad record:
materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music
125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music
> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") :
Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
EDIT 3:
Updating to the development version 1.9.7 of data tables breaks fread()
altogether:
> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") :
showProgress is not type integer but type 'logical'. Please report.
EDIT 4:
It seems that the problem in my file occurs when records contain the string \\"
(litteraly, not regular expression). Apparently, there's one backslash too many, causing fread()
to misinterpret a double quote as the end of a string, where it should have been taken litteraly.
My best solutions so far is to do this:
m1 <- readLines("data.csv", encoding="UTF-8")
m2 <- gsub("\\\\\"", "\\\"", m1)
writeLines(m2, "data_new.csv", useBytes = TRUE)
m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")
That seems to work.
I don't understand this 100% though, so any clarifications are more than welcome.
Single quotes need to be escaped by backslash in single-quoted strings, and double quotes in double-quoted strings. Alternative forms for the last two are '\u{nnnn}' and '\U{nnnnnnnn}'.
To check if the string has double quotes you can use: text_line. Contains("\""); Here \" will escape the double-quote.
Not a data.table solution, but you could try:
# read the file with 'readLines'
tmp <- readLines("trl.txt")
# create a column name vector of the first line
nms <- trimws(strsplit(tmp[1],',')[[1]])
# convert 'tmp' to a dataframe except the first line
tmp <- as.data.frame(tmp[-1])
# use 'separate' from 'tidyr' to split into two columns
library(tidyr)
df1 <- separate(tmp, "tmp[-1]", nms, sep=",", extra = "merge")
which gives:
> df1
id text
1 1 """Oops"",\\""The"",""Georgia"""
Update for edit 1: With the new example data fread
seems to be reading the data normally:
> s1 <- fread("sample1.txt", sep=",")
> s1
materiale_id dk5 description creator subject-phrase title type
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin NA Koncert i Copenhagen Jazz House den 26.1.1995 music
> s2 <- fread("sample2.txt", sep=",")
> s2
materiale_id dk5
1: 125030-katalog:000000003 [78.793]
2: 150012-leksikon:100019
description
1: Privatoptagelse. - Liveoptagelse
2: Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...
creator subject-phrase
1: Frederik Lundin
2: [""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]
title type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music
2: it - elementer i databehandling article
Update for edit 2 & 3:
When you look at the error-message:
Error in
fread("sample3.txt", sep = ",")
: Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embeddedsep=','
and/or (unescaped) '\n' characters within unbalanced unescaped quotes.fread
cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in?fread
.
and then when you look at the second line of sample3.txt
you will see that the fourth column contains comma's as well. You can solve this in three steps:
1: Read the file with readLines
and replace the opening and closing character of the fourth column with another quote-character:
r3 <- readLines("sample3.txt")
r3 <- gsub('\"[',"'",r3,fixed=TRUE)
r3 <- gsub(']\"',"'",r3,fixed=TRUE)
2: Write it back to a text-file:
writeLines(r3, "sample3-1.txt")
3: Now you can read it with fread
(or read.table
/read.csv
). Because the number of column-titles is not the same as the number of columns, you will have to use header = FALSE
. Also explicitely set the quote-character to the new quote-character as inserted in step 2:
s3 <- fread("sample3-1.txt", quote = "\'", header = FALSE, skip = 1)
which gives:
> s3
V1 V2 V3 V4 V5 V6 V7 V8
1: 125030-katalog:000236595 NA NA Red Tampa Solist prf ""Tom"",""Georgia"",""1929-1930"" "Georgia Tom 1929-1930" music
2: 125030-katalog:000236596 NA NA Jane Lucas (Solist) ""1928-1931"",""Tom,\\""The"",""Georgia"",""Accompanist"" "Georgia Tom ""The Accompanist"" (1928-1931)" music
After that you can assign column names as follows:
names(s3) <- c("character","vector","with","eight","column","names")
NOTE: I used a pretty recent version (two weeks old) of v1.9.7 for this
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With