Following on from my query last week reading badly formed csv in R - mismatched quotes, these same CSV files also have embedded control characters such as the ASCII Substitute Character which is decimal 26 or 0x1A. Unfortunately readLines()
seems to truncate the line at this character, so I am having difficulty in matching quotes - apart from losing the later fields in these lines!
I have tried to readBin()
but I can't get it to read this file. I'm afraid I can't cleanly read this into R to give you an example and I'm having difficulty in creating these in R. Sorry not to be able to demonstrate with a clean example. Thoughts?
Update
Now I'm confused - when I use the code
h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(as.integer(k1), 26, 65))), '",99')
identical(readLines(textConnection(h3)), h3)
I get TRUE
which I find quite surprising!
Update 2
h3
[1] "1,34,44.4,\" HIJK\032A \",99"
> writeLines(h3, 'h3.txt')
> h3a <- readLines('h3.txt')
Warning message:
In readLines("h3.txt") : incomplete final line found on 'h3.txt'
> h3a
[1] "1,34,44.4,\" HIJK"
So readLines() reacts differently when coming from a textConnection()
and it silently truncates at the SUB character.
I would be surprised if it makes a difference but I'm on 2.15.2 on Windows-64.
Update 3
Some vague success in solving this...
zb <- file('h3.txt', "rb")
tmp <- readBin(zb, raw(), size=1, n=400) # raw is always of size =1
nchar(tmp)
# [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
close(zb)
tmp
# [1] 31 2c 33 34 2c 34 34 2e 34 2c 22 20 48 49 4a 4b 1a 41 20 22 2c 39 39 0d 0a
rawToChar(tmp)
# [1] "1,34,44.4,\" HIJK\032A \",99\r\n"
i.e. if I read in the file as binary and convert to character() afterwards it seems to work... this will be tedious for large CSV files...
Could there be a bug in R in incorrectly detecting a Control-Z as end of file on windows??
Reading Text (*.txt) files in R is easy and simple enough. If you have data in a *.txt file or a tab-delimited text file, you can easily import it with read.table ( ) function. Suppose we have a data file named "Hald.txt" stored at path "D:STATSTA-654Hald.txt". The following code line can be used for reading text (*.txt) files in R:
Now, we can use the read_file function of the readr package to import our TXT file as character string: That worked well! If you need further information on the R programming code of this article, you might watch the following video of my YouTube channel.
Some of these text files end with a SUB character (a substitute character. It may be 0x1A.) How do I detect this character and remove it from the text file using C#?
Ctrl+Z, is traditionally often described as ^Z). Unicode encodes this character either, but recommends to use the replacement character (, U+FFFD) instead of representing un-decodable inputs, in cases when the output encoding is compatible with it.
I also ran into this problem when I used read.csv with a csv file that contained the SUB or CTRL-Z in the middle of the file.
Solved it with the readr package (if your file is comma separated)
library(readr)
read_csv("h3.txt")
If you have a ; as a separator, then use:
library(readr)
read_csv2("h3.txt")
I think I've figured out a solution - because there appears to be a problem reading a Control-Z in the middle of a file on Windows, we need to read the file in binary / raw mode.
fnam <- 'h3.txt'
tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(dfnam)$size, 100))=1
tmp.char <- rawToChar(tmp.bin)
txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE))
txt
[1] "1,34,44.4,\" HIJK\032A \",99"
Update The following better answer was posted by Duncan Murdoch to R-Devel refer. Converting it into a function I get:
sReadLines <- function(fnam) {
f <- file(fnam, "rb")
res <- readLines(f)
close(f)
res
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With