I want to write to a file with UTF-8 encoding containing the character
10001100
which is Œ
the Latin capital ligature OE in extended ASCII table,
zz <- file("c:/testbin", "wb")
writeBin("10001100",zz)
close(zz)
When I open the file with office(encoding=utf-8), I can see Œ
what I can not read is with readBin?
zz <- file("c:/testbin", "rb")
readBin(zz,raw())->x
x
[1] c5
readBin(zz,character())->x
Warning message:
In readBin(zz, character()) :
incomplete string at end of file has been discarded
x
character(0)
The first 128 characters in the Unicode library match those in the ASCII library, and UTF-8 translates these 128 Unicode characters into the same binary strings as ASCII. As a result, UTF-8 can take a text file formatted by ASCII and convert it to human-readable text without issue.
It is a code that uses numbers to represent characters. Each letter is assigned a number between 0 and 127. A upper and lower case character are assigned different numbers. For example the character A is assigned the decimal number 65, while a is assigned decimal 97 as shown below int the ASCII table.
For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.
UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.
There are multiple difficulties here.
Windows-1252
or ANSI
, and the Win default "latin" encoding. However the code for Œ
varies within this family of tables. In CP1252
, "Œ"
is represented by 10001100
or "\x8c"
, as you wrote. However it does not exist in ISO-8859-1
. And in UTF-8
it corresponds to "\xc5\x92"
or "\u0152"
, as rlegendi indicated.So, to write UTF-8
from CP1252
-as-binary-as-string, you have to convert your string into it a "raw" number (the R class for bytes) and then a character, change its "encoding" from CP1252
to UTF-8
(in fact convert its byte value to the corresponding one for the same character in UTF-8
), after that you can re-convert it to raw, and finally write to the file:
char_bin_str <- '10001100'
char_u <- iconv(rawToChar(as.raw(strtoi(char_bin_str, base=2))),
# "\x8c" 8c 140 '10001100'
from="CP1252",
to="UTF-8")
test.file <- "~/test-unicode-bytes.txt"
zz <- file(test.file, 'wb')
writeBin(charToRaw(char_u), zz)
close(zz)
Secondly, when you readBin()
, do not forget to give a number of bytes to read which is big enough (n=file.info(test.file)$size
here), otherwise it reads only the first byte (see below):
zz <- file(test.file, 'rb') x <- readBin(zz, 'raw', n=file.info(test.file)$size) close(zz)
x [1] c5 92
Thirdly, if in the end you want to turn it back into a character, correctly understood and displayed by R, you have first to convert it into a string with rawToChar()
. Now, the way it will be displayed depends on your default encoding, see Sys.getlocale()
to see what it is (probably something ending with 1252
on Windows). The best is probably to specify that your character should be read as UTF-8
– otherwise it will be understood with your default encoding.
xx <- rawToChar(x) Encoding(xx) <- "UTF-8"
xx [1] "Œ"
This should keep things under control, write the correct bytes in UTF-8
, and be the same on every OS. Hope it helps.
PS: I am not exactly sure why in your code x
returned c5
, and I guess it would have returned c5 92
if you had set n=2
(or more) as a parameter to readBin()
. On my machine (Mac OS X 10.7, R 3.0.2 and Win XP, R 2.15) it returns 31
, the hex ASCII representation of '1'
(the first char in '10001100'
, which makes sense), with your code. Maybe you opened your file in Office as CP1252
and saved it as UTF-8
there, before coming back to R?
Try this instead (I replaced the binary value with the UTF encoding because I think it is better when you want such an output):
writeBin(charToRaw("\u0152"), zz)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With