Following on from my query last week reading badly formed csv in R - mismatched quotes, these same CSV files also have embedded control characters such as the ASCII Substitute Character which is decimal 26 or 0x1A. Unfortunately <code>readLines()</code> seems to truncate the line at this character, so I am having difficulty in matching quotes - apart from losing the later fields in these lines! I have tried to <code>readBin()</code> but I can't get it to read this file. I'm afraid I can't cleanly read this into R to give you an example and I'm having difficulty in creating these in R. Sorry not to be able to demonstrate with a clean example. Thoughts? Update Now I'm confused - when I use the code <pre class="prettyprint"><code> h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(as.integer(k1), 26, 65))), '",99') identical(readLines(textConnection(h3)), h3) </code></pre> I get <code>TRUE</code> which I find quite surprising! Update 2 <pre class="prettyprint"><code> h3 [1] "1,34,44.4,\" HIJK\032A \",99" > writeLines(h3, 'h3.txt') > h3a <- readLines('h3.txt') Warning message: In readLines("h3.txt") : incomplete final line found on 'h3.txt' > h3a [1] "1,34,44.4,\" HIJK" </code></pre> So readLines() reacts differently when coming from a <code>textConnection()</code> and it silently truncates at the SUB character. I would be surprised if it makes a difference but I'm on 2.15.2 on Windows-64. Update 3 Some vague success in solving this... <pre class="prettyprint"><code>zb <- file('h3.txt', "rb") tmp <- readBin(zb, raw(), size=1, n=400) # raw is always of size =1 nchar(tmp) # [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 close(zb) tmp # [1] 31 2c 33 34 2c 34 34 2e 34 2c 22 20 48 49 4a 4b 1a 41 20 22 2c 39 39 0d 0a rawToChar(tmp) # [1] "1,34,44.4,\" HIJK\032A \",99\r\n" </code></pre> i.e. if I read in the file as binary and convert to character() afterwards it seems to work... this will be tedious for large CSV files... Could there be a bug in R in incorrectly detecting a Control-Z as end of file on windows??

I also ran into this problem when I used read.csv with a csv file that contained the SUB or CTRL-Z in the middle of the file. Solved it with the readr package (if your file is comma separated) <pre class="prettyprint"><code>library(readr) read_csv("h3.txt") </code></pre> If you have a ; as a separator, then use: <pre class="prettyprint"><code>library(readr) read_csv2("h3.txt") </code></pre>

I think I've figured out a solution - because there appears to be a problem reading a Control-Z in the middle of a file on Windows, we need to read the file in binary / raw mode. <pre class="prettyprint"><code>fnam <- 'h3.txt' tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(dfnam)$size, 100))=1 tmp.char <- rawToChar(tmp.bin) txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE)) txt [1] "1,34,44.4,\" HIJK\032A \",99" </code></pre> Update The following better answer was posted by Duncan Murdoch to R-Devel refer. Converting it into a function I get: <pre class="prettyprint"><code>sReadLines <- function(fnam) { f <- file(fnam, "rb") res <- readLines(f) close(f) res } </code></pre>

reading in a text file with a SUB (1a) (Control-Z) character in R on Windows

Tags:

windows

parsing

r

Following on from my query last week reading badly formed csv in R - mismatched quotes, these same CSV files also have embedded control characters such as the ASCII Substitute Character which is decimal 26 or 0x1A. Unfortunately readLines() seems to truncate the line at this character, so I am having difficulty in matching quotes - apart from losing the later fields in these lines!

I have tried to readBin() but I can't get it to read this file. I'm afraid I can't cleanly read this into R to give you an example and I'm having difficulty in creating these in R. Sorry not to be able to demonstrate with a clean example. Thoughts?

Update

Now I'm confused - when I use the code

 h3 <- paste('1,34,44.4,"', rawToChar(as.raw(c(as.integer(k1), 26, 65))), '",99')
 identical(readLines(textConnection(h3)), h3)

I get TRUE which I find quite surprising!

Update 2

 h3
[1] "1,34,44.4,\" HIJK\032A \",99"
> writeLines(h3, 'h3.txt')
> h3a <- readLines('h3.txt')
Warning message:
In readLines("h3.txt") : incomplete final line found on 'h3.txt'
> h3a
[1] "1,34,44.4,\" HIJK"

So readLines() reacts differently when coming from a textConnection() and it silently truncates at the SUB character.

I would be surprised if it makes a difference but I'm on 2.15.2 on Windows-64.

Update 3

Some vague success in solving this...

zb <- file('h3.txt', "rb")
tmp <- readBin(zb, raw(), size=1, n=400) # raw is always of size =1
nchar(tmp)
# [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
close(zb)
tmp
# [1] 31 2c 33 34 2c 34 34 2e 34 2c 22 20 48 49 4a 4b 1a 41 20 22 2c 39 39 0d 0a
rawToChar(tmp)
# [1] "1,34,44.4,\" HIJK\032A \",99\r\n"

i.e. if I read in the file as binary and convert to character() afterwards it seems to work... this will be tedious for large CSV files...

Could there be a bug in R in incorrectly detecting a Control-Z as end of file on windows??

878

asked Apr 08 '13 08:04

Sean

2 Answers

I also ran into this problem when I used read.csv with a csv file that contained the SUB or CTRL-Z in the middle of the file.

Solved it with the readr package (if your file is comma separated)

library(readr)
read_csv("h3.txt")

If you have a ; as a separator, then use:

library(readr)
read_csv2("h3.txt")

answered Nov 13 '22 21:11

Sander van den Oord

I think I've figured out a solution - because there appears to be a problem reading a Control-Z in the middle of a file on Windows, we need to read the file in binary / raw mode.

fnam <- 'h3.txt'
tmp.bin <- readBin(fnam, raw(), size=1, n=max(2*file.info(dfnam)$size, 100))=1
tmp.char <- rawToChar(tmp.bin)
txt <- unlist(strsplit(tmp.char, '\r\n', fixed=TRUE))
txt

[1] "1,34,44.4,\" HIJK\032A \",99"

Update The following better answer was posted by Duncan Murdoch to R-Devel refer. Converting it into a function I get:

sReadLines <- function(fnam) {
    f <- file(fnam, "rb")
    res <- readLines(f)
    close(f)
    res
}

186

answered Nov 13 '22 20:11

Sean

Related questions
                            
                                Intercept MS Windows 'SendTo' menu calls?
                            
                                How can I reference UWP classes in PowerShell
                            
                                Library for drawing musical notes [closed]
                            
                                Why does this pyd file not import on some computers?
                            
                                Signing a C++ executable in Visual Studio
                            
                                When return E_POINTER and when E_INVALIDARG?
                            
                                Do I need afxres.h, if I am not using MFC? How do I remove it from the .RC script?
                            
                                How to disable buffer overflow checking in the Visual C++ Runtime?
                            
                                How to lock pages in memory using WinAPI?
                            
                                How to Embed/Link binary data into a Windows module
                            
                                Process Exit Code When Process is Killed Forcibly
                            
                                Java access to intermediate CAs from Windows keystores?
                            
                                Third party code is modifying the FPU control word
                            
                                What are the differences between RedrawWindow and UpdateWindow in Win32?
                            
                                How to make Boost DLLs accessible to an executable built with CMake?
                            
                                php/iis: failed to open stream: Permission denied
                            
                                Does winapi's bcrypt.h actually support bcrypt hashing?
                            
                                Installing pathogen vim plugin on Windows
                            
                                PHP regex crashing apache
                            
                                java how to know if you're running javaw.exe vs. java.exe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

reading in a text file with a SUB (1a) (Control-Z) character in R on Windows

Tags:

windows

parsing

r

Sean

People also ask

2 Answers

Sander van den Oord

Sean

Recent Activity

Donate For Us