I've got a strange text file with a bunch of <code>NUL</code> characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files. With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space (<code>[NUL][NUL]</code>-><code></code>) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road). However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - <code>readLines</code> throws an error whenever I try to use it on these files (unless I activate <code>skipNul</code>). Is there any way to get the lines of this file into R so I could use <code>gsub</code> or whatever else to fix this issue without resorting to external programs?

You want to read the file as binary then you can substitute the <code>NUL</code>s, e.g. to replace them by spaces: <pre class="prettyprint"><code>r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size) r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space> writeBin(r, "00staff.txt") str(readLines("00staff.txt")) # chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ... </code></pre> You could also substitute the <code>NUL</code>s with a really rare character (such as <code>"\01"</code>) and work on the string in place, e.g., let's say if you want to replace two <code>NUL</code>s (<code>"\00\00"</code>) with one space: <pre class="prettyprint"><code>r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size) r[r==as.raw(0)] = as.raw(1) a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE) s = strsplit(a, "\n", TRUE)[[1]] str(s) # chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ </code></pre>

Removing "NUL" characters (within R)

Tags:

string

r

nul

I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files. With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).

However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).

Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?

818

asked Dec 11 '15 01:12

MichaelChirico

1 Answers

You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
#  chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__ ...

You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:

r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson            Shelley J       FW1949     2000R000000000000119460007620            3  0007000704002097907KGKG1616"| __truncated__

193

answered Sep 23 '22 17:09

Simon Urbanek

Related questions
                            
                                Looping and clustering
                            
                                how to replace specific characters in a data frame by the value in a variable in r
                            
                                How to save an adjacency matrix as a CSV file?
                            
                                Parsing date in Mon, DD, YYYY format using RegEx in R
                            
                                R: Plotting panel model predictions using plm & pglm
                            
                                dbSendQuery only returning 500 rows when using RMySQL in R
                            
                                Shiny: use styleColorBar with data from two data frames
                            
                                regex match substring unless another substring matches
                            
                                Unexpected character json error in r
                            
                                merge data frame - column prefix
                            
                                Converting a character string into a date in R
                            
                                How can I make the Venn-Diagram colourful in R {venn-gplots}?
                            
                                Use of "list" in data.table's j argument
                            
                                RODBC command 'sqlQuery' has problems with table variables in t-SQL
                            
                                How to subset a matrix with different column positions for each row? [duplicate]
                            
                                Quickly retrieve pvalues from multiple lm() in R
                            
                                Create new data frame based on values from another data frame
                            
                                Why does R incorrectly perform sum here?
                            
                                Saving a single object within a function in R: RData file size is very large
                            
                                ggmap extended zoom or boundaries

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With