I have hundreds of large CSV files (sizes vary from 10k lines to 100k lines in each) and some of them have badly formed descriptions with quotes within quotes so they might look something like
ID,Description,x
3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486
I need to be able to cleanly parse all of these lines in R as CSV. dput()'ing it and reading ...
txt <- c("ID,Description,x",
"3434,\"abc\"def\",988",
"2344,\"fred\",3484",
"2345,\"fr\"\"ed\",3485",
"2346,\"joe,fred\",3486")
read.csv(text=txt[1:4], colClasses='character')
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'text'
If we change the quoting and do not include the last line with the embedded comma - it works well
read.csv(text=txt[1:4], colClasses='character', quote='')
However, if we change the quoting and include the last line with the embedded comma...
read.csv(text=txt[1:5], colClasses='character', quote='')
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 4 elements
EDIT x2: Should have said that unfortunately some of the descriptions have commas in them - code is edited above.
Change the quote setting:
read.csv(text=txt, colClasses='character',quote = "")
ID Description x
1 3434 "abc"def" 988
2 2344 "fred" 3484
3 2345 "fr""ed" 3485
4 2346 "joe" 3486
txt <- c("ID,Description,x",
"3434,\"abc\"def\",988",
"2344,\"fred\",3484",
"2345,\"fr\"\"ed\",3485",
"2346,\"joe,fred\",3486")
txt2 <- readLines(textConnection(txt))
txt2 <- strsplit(txt2,",")
txt2 <- lapply(txt2,function(x) c(x[1],paste(x[2:(length(x)-1)],collapse=","),x[length(x)]) )
m <- do.call("rbind",txt2)
df <- as.data.frame(m,stringsAsFactors = FALSE)
names(df) <- df[1,]
df <- df[-1,]
# ID Description x
# 2 3434 "abc"def" 988
# 3 2344 "fred" 3484
# 4 2345 "fr""ed" 3485
# 5 2346 "joe,fred" 3486
No idea, if that is sufficiently efficient for your use case.
As there is only one quoted column in this set of nasty files, I can do a read.csv() on each side to handle the other unquoted columns left and right of the quoted column, so my current solution based on the info from both @agstudy and @roland
csv.parser <- function(txt) {
df <- do.call('rbind', regmatches(txt,gregexpr(',"|",',txt),invert=TRUE))
# remove the header
df <- df[-1,]
# parse the left csv
df1 <- read.csv(text=df[,1], colClasses='character', comment='', header=FALSE)
# parse the right csv
df3 <- read.csv(text=df[,3], colClasses='character', comment='', header=FALSE)
# put them back together
dfa <- cbind(df1, df[,2], df3)
# put the header back in
names(dfa) <- names(read.csv(text=txt[1], header=TRUE))
dfa
}
# debug(csv.parser)
csv.parser(txt)
so running this on a wider set of data works thankfully.
txt <- c("ID,Description,x,y",
"3434,\"abc\"def\",988,344",
"2344,\"fred\",3484,3434",
"2345,\"fr\"\"ed\",3485,7347",
"2346,\"joe,fred\",3486,484")
csv.parser(txt)
ID Description x y
1 3434 abc"def 988 344
2 2344 fred 3484 3434
3 2345 fr""ed 3485 7347
4 2346 joe,fred 3486 484
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With