Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing "NUL" characters (within R)

Tags:

string

r

nul

I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files. With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).

However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).

Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?

like image 818
MichaelChirico Avatar asked Dec 11 '15 01:12

MichaelChirico


People also ask

How do I remove a null character?

Using the -d switch we delete a character. A backslash followed by three 0's represents the null character. This just deletes these characters and writes the result to a new file.

What is nul in text file?

The null character (also null terminator) is a control character with the value zero. It is present in many character sets, including those defined by the Baudot and ITA2 codes, ISO/IEC 646 (or ASCII), the C0 control code, the Universal Coded Character Set (or Unicode), and EBCDIC.

What is NUL in binary?

A binary null character is just a char with an integer/ASCII value of 0. You can create a null character with Convert. ToChar(0) or the more common, more well-recognized '\0' .


1 Answers

You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
#  chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__ ...

You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__
like image 193
Simon Urbanek Avatar answered Sep 23 '22 17:09

Simon Urbanek