Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reading in a text file with a SUB (1a) (Control-Z) character in R on Windows

Tags:

windows

parsing

r

Following on from my query last week reading badly formed csv in R - mismatched quotes, these same CSV files also have embedded control characters such as the ASCII Substitute Character which is decimal 26 or 0x1A. Unfortunately readLines() seems to truncate the line at this character, so I am having difficulty in matching quotes - apart from losing the later fields in these lines!

I have tried to readBin() but I can't get it to read this file. I'm afraid I can't cleanly read this into R to give you an example and I'm having difficulty in creating these in R. Sorry not to be able to demonstrate with a clean example. Thoughts?

Update

Now I'm confused - when I use the code

 h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(as.integer(k1), 26, 65))), '",99')
 identical(readLines(textConnection(h3)), h3)

I get TRUE which I find quite surprising!

Update 2

 h3
[1] "1,34,44.4,\" HIJK\032A \",99"
> writeLines(h3, 'h3.txt')
> h3a <- readLines('h3.txt')
Warning message:
In readLines("h3.txt") : incomplete final line found on 'h3.txt'
> h3a
[1] "1,34,44.4,\" HIJK"

So readLines() reacts differently when coming from a textConnection() and it silently truncates at the SUB character.

I would be surprised if it makes a difference but I'm on 2.15.2 on Windows-64.

Update 3

Some vague success in solving this...

zb <- file('h3.txt', "rb")
tmp <- readBin(zb, raw(), size=1, n=400) # raw is always of size =1
nchar(tmp)
# [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
close(zb)
tmp
# [1] 31 2c 33 34 2c 34 34 2e 34 2c 22 20 48 49 4a 4b 1a 41 20 22 2c 39 39 0d 0a
rawToChar(tmp)
# [1] "1,34,44.4,\" HIJK\032A \",99\r\n"

i.e. if I read in the file as binary and convert to character() afterwards it seems to work... this will be tedious for large CSV files...

Could there be a bug in R in incorrectly detecting a Control-Z as end of file on windows??

like image 878
Sean Avatar asked Apr 08 '13 08:04

Sean


People also ask

How to read text files in R?

Reading Text (*.txt) files in R is easy and simple enough. If you have data in a *.txt file or a tab-delimited text file, you can easily import it with read.table ( ) function. Suppose we have a data file named "Hald.txt" stored at path "D:STATSTA-654Hald.txt". The following code line can be used for reading text (*.txt) files in R:

How to import TXT file as character string in R?

Now, we can use the read_file function of the readr package to import our TXT file as character string: That worked well! If you need further information on the R programming code of this article, you might watch the following video of my YouTube channel.

What does sub mean in a text file?

Some of these text files end with a SUB character (a substitute character. It may be 0x1A.) How do I detect this character and remove it from the text file using C#?

What does CTRL Ctrl Z mean?

Ctrl+Z, is traditionally often described as ^Z). Unicode encodes this character either, but recommends to use the replacement character (, U+FFFD) instead of representing un-decodable inputs, in cases when the output encoding is compatible with it.


2 Answers

I also ran into this problem when I used read.csv with a csv file that contained the SUB or CTRL-Z in the middle of the file.

Solved it with the readr package (if your file is comma separated)

library(readr)
read_csv("h3.txt")

If you have a ; as a separator, then use:

library(readr)
read_csv2("h3.txt")
like image 28
Sander van den Oord Avatar answered Nov 13 '22 21:11

Sander van den Oord


I think I've figured out a solution - because there appears to be a problem reading a Control-Z in the middle of a file on Windows, we need to read the file in binary / raw mode.

fnam <- 'h3.txt'
tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(dfnam)$size, 100))=1
tmp.char <- rawToChar(tmp.bin)
txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE))
txt

[1] "1,34,44.4,\" HIJK\032A \",99"

Update The following better answer was posted by Duncan Murdoch to R-Devel refer. Converting it into a function I get:

sReadLines <- function(fnam) {
    f <- file(fnam, "rb")
    res <- readLines(f)
    close(f)
    res
}
like image 186
Sean Avatar answered Nov 13 '22 20:11

Sean