Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Unicode codepoints above U+FFFF in Rebol 3 strings like in Rebol 2?

I know you can't use caret style escaping in strings for codepoints bigger than ^(FF) in Rebol 2, because it doesn't know anything about Unicode. So this doesn't generate anything good, it looks messed up:

print {Q: What does a Zen master's {Cow} Say?  A: "^(03BC)"!}

Yet the code works in Rebol 3 and prints out:

Q: What does a Zen master's {Cow} Say?  A: "μ"!

That's great, but R3 maxes out its ability to hold a character in a string at all at U+FFFF apparently:

>> type? "^(FFFF)"
== string!

>> type? "^(010000)"
** Syntax error: invalid "string" -- {"^^(010000)"}
** Near: (line 1) type? "^(010000)"

The situation is a lot better than the random behavior of Rebol 2 when it met codepoints it didn't know about. However, there used to be a workaround in Rebol for storing strings if you knew how to do your own UTF-8 encoding (or got your strings by way of loading source code off disk). You could just assemble them from individual characters.

So the UTF-8 encoding of U+010000 is #F0908080, and you could before say:

workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]

And you'd get a string with that single codepoint encoded using UTF-8, that you could save to disk in code blocks and read back in again. Is there any similar trick in R3?

like image 593
HostileFork says dont trust SE Avatar asked Feb 25 '13 22:02

HostileFork says dont trust SE


2 Answers

There is a workaround using the string! datatype as well. You cannot use UTF-8 in that case, but you can use UTF-16 workaround as follows:

utf-16: "^(d800)^(dc00)"

, which encodes the ^(10000) code point using UTF-16 surrogate pair. In general, the following function can do the encoding:

utf-16: func [
    code [integer!]
    /local low high
] [
    case [
        code < 0 [do make error! "invalid code"]
        code < 65536 [append copy "" to char! code]
        code < 1114112 [
            code: code - 65536
            low: code and 1023
            high: code - low / 1024
            append append copy "" to char! high + 55296 to char! low + 56320
        ]
        'else [do make error! "invalid code"]
    ]
]
like image 92
Ladislav Avatar answered Oct 25 '22 10:10

Ladislav


Yes, there is a trick...which is the trick you should have been using in R2 as well. Don't use a string! Use a binary! if you have to do this sort of thing:

good-workaround: #{F0908080}

It would've worked in Rebol2, and it works in Rebol3. You can save it and load it without any funny business.

In fact, if care about Unicode at all, ever...stop doing string processing that is using codepoints higher than ^(7F) if you are stuck in Rebol 2 and not 3. We'll see why by looking at that terrible workaround:

terrible-workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]

..."And you'd get a string with that single UTF-8 codepoint"...

The only thing you should get is a string with four individual character codepoints, and with 4 = length? terrible-workaround. Rebol2 is broken because string! is basically no different from binary! under the hood. In fact, in Rebol2 you could alias the two types back and forth without making a copy, look up AS-BINARY and AS-STRING. (This is impossible in Rebol3 because they really are fundamentally different, so don't get attached to the feature!)

It's somewhat deceptive to see these strings reporting a length of 4, and there's a false comfort of each character producing the same value if you convert them to integer!. Because if you ever write them out to a file or port somewhere, and they need to be encoded, you'll get bitten. Note this in Rebol2:

>> to integer! #"^(80)"
== 128

>> to binary! #"^(80)"
== #{80}

But in R3, you have a UTF-8 encoding when binary conversion is needed:

>> to integer! #"^(80)"
== 128

>> to binary! #"^(80)"
== #{C280}

So you will be in for a surprise when your seemingly-working code does something different at a later time, and winds up serializing differently. In fact, if you want to know how "messed up" R2 is in this regard, look at why you got a weird symbol for your "mu". In R2:

>> to binary! #"^(03BC)"
== #{BC}

It just threw the "03" away. :-/

So if you need for some reason to work with a Unicode strings and can't switch to R3, try something like this for the cow example:

mu-utf8: #{03BC}
utf8: rejoin [#{} {Q: What does a Zen master's {Cow} Say?  A: "} mu-utf8 {"!}]

That gets you a binary. Only convert it to string for debug output, and be ready to see gibberish. But it is the right thing to do if you're stuck in Rebol2.

And to reiterate the answer: it's also what to do if for some odd reason stuck needing to use those higher codepoints in Rebol3:

utf8: rejoin [#{} {Q: What did the Mycenaean's {Cow} Say?  A: "} #{010000} {"!}]

I'm sure that would be a very funny joke if I knew what LINEAR B SYLLABLE B008 A was. Which leads me to say that most likely, if you're doing something this esoteric you probably only have a few codepoints being cited as examples. You can hold most of your data as string up until you need to slot them in conveniently, and hold the result in a binary series.


UPDATE: If one hits this problem, here is a utility function that can be useful for working around it temporarily:

safe-r2-char: charset [#"^(00)" - #"^(7F)"]
unsafe-r2-char: charset [#"^(80)" - #"^(FF)"]
hex-digit: charset [#"0" - #"9" #"A" - #"F" #"a" - #"f"]

r2-string-to-binary: func [
    str [string!] /string /unescape /unsafe
    /local result s e escape-rule unsafe-rule safe-rule rule
] [
    result: copy either string [{}] [#{}]
    escape-rule: [
        "^^(" s: 2 hex-digit e: ")" (
            append result debase/base copy/part s e 16
        )
    ]
    unsafe-rule: [
        s: unsafe-r2-char (
            append result to integer! first s
        )
    ]
    safe-rule: [
        s: safe-r2-char (append result first s)
    ]
    rule: compose/deep [
        any [
            (either unescape [[escape-rule |]] [])
            safe-rule
            (either unsafe [[| unsafe-rule]] [])
        ]
    ]
    unless parse/all str rule [
        print "Unsafe codepoints found in string! by r2-string-to-binary"
        print "See http://stackoverflow.com/questions/15077974/"
        print mold str
        throw "Bad codepoint found by r2-string-to-binary"
    ]
    result
]

If you use this instead of a to binary! conversion, you will get the consistent behavior in both Rebol2 and Rebol3. (It effectively implements a solution for terrible-workaround style strings.)

like image 36
HostileFork says dont trust SE Avatar answered Oct 25 '22 10:10

HostileFork says dont trust SE