How to write and read printable ASCII characters to/from UTF-8 encoding file?

Tags:

I want to write to a file with UTF-8 encoding containing the character 10001100 which is Œ the Latin capital ligature OE in extended ASCII table,

zz <- file("c:/testbin", "wb")
writeBin("10001100",zz)
close(zz)

When I open the file with office(encoding=utf-8), I can see Œ what I can not read is with readBin?

zz <- file("c:/testbin", "rb")
readBin(zz,raw())->x
x
[1] c5
readBin(zz,character())->x
Warning message:
In readBin(zz, character()) :
incomplete string at end of file has been discarded
x
character(0)

213

asked Dec 11 '13 08:12

showkey

2 Answers

There are multiple difficulties here.

Firstly, there are actually several "Extended ASCII" tables. Since you are on Windows you are probably using CP1252 which is one of them, also called Windows-1252 or ANSI, and the Win default "latin" encoding. However the code for Œ varies within this family of tables. In CP1252, "Œ" is represented by 10001100 or "\x8c", as you wrote. However it does not exist in ISO-8859-1. And in UTF-8 it corresponds to "\xc5\x92" or "\u0152", as rlegendi indicated.

So, to write UTF-8 from CP1252-as-binary-as-string, you have to convert your string into it a "raw" number (the R class for bytes) and then a character, change its "encoding" from CP1252 to UTF-8 (in fact convert its byte value to the corresponding one for the same character in UTF-8), after that you can re-convert it to raw, and finally write to the file:

char_bin_str <- '10001100'
char_u <- iconv(rawToChar(as.raw(strtoi(char_bin_str, base=2))),
              # "\x8c"    8c     140    '10001100'
                from="CP1252",
                to="UTF-8")

test.file <- "~/test-unicode-bytes.txt"

zz <- file(test.file, 'wb')
writeBin(charToRaw(char_u), zz)
close(zz)

Secondly, when you readBin(), do not forget to give a number of bytes to read which is big enough (n=file.info(test.file)$size here), otherwise it reads only the first byte (see below):

zz <- file(test.file, 'rb') x <- readBin(zz, 'raw', n=file.info(test.file)$size) close(zz)

x [1] c5 92
Thirdly, if in the end you want to turn it back into a character, correctly understood and displayed by R, you have first to convert it into a string with rawToChar(). Now, the way it will be displayed depends on your default encoding, see Sys.getlocale() to see what it is (probably something ending with 1252 on Windows). The best is probably to specify that your character should be read as UTF-8 – otherwise it will be understood with your default encoding.

xx <- rawToChar(x) Encoding(xx) <- "UTF-8"

xx [1] "Œ"

This should keep things under control, write the correct bytes in UTF-8, and be the same on every OS. Hope it helps.

PS: I am not exactly sure why in your code x returned c5, and I guess it would have returned c5 92 if you had set n=2 (or more) as a parameter to readBin(). On my machine (Mac OS X 10.7, R 3.0.2 and Win XP, R 2.15) it returns 31, the hex ASCII representation of '1' (the first char in '10001100', which makes sense), with your code. Maybe you opened your file in Office as CP1252 and saved it as UTF-8 there, before coming back to R?

141

answered Nov 11 '22 08:11

AlxH

Try this instead (I replaced the binary value with the UTF encoding because I think it is better when you want such an output):

writeBin(charToRaw("\u0152"), zz)

answered Nov 11 '22 07:11

rlegendi

Related questions
                            
                                Show internal structure of an R object
                            
                                How to create black and white transparent overlapping histograms using ggplot2?
                            
                                Create list from each column of matrix in R
                            
                                Installing RcppEigen on amazon ec2
                            
                                Speed up `strsplit` when possible output are known
                            
                                ideal() in R package pscl not producing repeatable results
                            
                                Remove connecting lines in ggplot2 geom_polygon
                            
                                facet_wrap: How to add y axis to every individual graph when scales="free_x"?
                            
                                How can I get R to read a column of numbers in exponential notation?
                            
                                adjusting axis labels NVD3 graph in rCharts
                            
                                RDotNet vs R scripting
                            
                                Nested function environment selection
                            
                                Slidify: Alignment of Codes
                            
                                Plotting linear functions on a ggplot with log-log scales
                            
                                Title key on each panel of a plot generated with par(mfrow=c(x,y))
                            
                                Function Factory in R
                            
                                create all possible permutations of two vectors in R [duplicate]
                            
                                Labeling points in a biplot
                            
                                How can I run an executable .jar file in an R script?
                            
                                As.character returning numbers instead of strings R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to write and read printable ASCII characters to/from UTF-8 encoding file?

Tags:

file-io

r

ascii

utf-8

file-encodings

showkey

People also ask

2 Answers

AlxH

rlegendi

Recent Activity

Donate For Us