Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a function to decode encoded unicode utf-8 string like from a form?

I want to store some data with a html form and Rebol cgi. My form looks like this:

<form action="test.cgi" method="post" >

     Input:

     <input type="text" name="field"/>
     <input type="submit" value="Submit" />

</form>

But for unicode characters like Chinese, I get the encoded form of the data with percent signs, for instance %E4%BA%BA.

(This is for the Chinese character "人" ... its UTF-8 form as a Rebol binary literal is #{E4BABA})

Is there a function in the system, or an existing library that can decode this directly? dehex does not appear to currently cover this case. I'm currently decoding this manually by removing the percent signs and constructing the corresponding binary, like this:

data: to-string read system/ports/input
print data

;-- this prints "field=%E4%BA%BA"

k-v: parse data "="
print k-v

;-- this prints ["field" "%E4%BA%BA"]

v: append insert replace/all k-v/2 "%" "" "#{" "}"
print v

;-- This prints "#{E4BABA}" ... a string!, not binary!
;-- LOAD will help construct the corresponding binary
;-- then TO-STRING will decode that binary from UTF-8 to character codepoints

write %test.txt to-string load v
like image 333
Wayne Cui Avatar asked Aug 20 '13 09:08

Wayne Cui


People also ask

Are Unicode and UTF-8 the same?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

Can UTF-8 be read as ASCII?

Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.

What is a UTF-8 encoded string?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

Is UTF-8 ASCII or Unicode?

UTF-8 encodes Unicode characters into a sequence of 8-bit bytes. The standard has a capacity for over a million distinct codepoints and is a superset of all characters in widespread use today. By comparison, ASCII (American Standard Code for Information Interchange) includes 128 character codes.


2 Answers

I have a library called AltWebForm that en/decodes percent-encoded web form data:

do http://reb4.me/r3/altwebform
load-webform "field=%E4%BA%BA"

The library is described here: Rebol and Web Forms.

like image 81
rgchris Avatar answered Oct 25 '22 02:10

rgchris


Looks to be related to ticket #1986, where it is discussed whether this is a "bug" or the Internet changing out from under its own spec:

Have DEHEX convert UTF-8 sequences from browsers as Unicode.

If you have specific experience on what has become standard in Chinese, and want to weigh in, that would be valuable.

Just as an aside, the specific case above could have been handled in PARSE alternately as:

key-value: {field=%E4%BA%BA}

utf8-bytes: copy #{}

either parse key-value [
    copy field-name to {=}
    skip
    some [
        and {%}
        copy enhexed-byte 3 skip (
            append utf8-bytes dehex enhexed-byte
        )
    ]
] [
    print [field-name {is} to string! utf8-bytes]
] [
    print {Malformed input.}
]

That will output:

field is 人

With some comments included:

key-value: {field=%E4%BA%BA}

;-- Generate empty binary value by copying an empty binary literal     
utf8-bytes: copy #{}

either parse key-value [

    ;-- grab field-name as the chars right up to the equals sign
    copy field-name to {=}

    ;-- skip the equal sign as we went up to it, without moving "past" it
    skip

    ;-- apply the enclosed rule SOME (non-zero) number of times
    some [
        ;-- match a percent sign as the immediate next symbol, without
        ;-- advancing the parse position
        and {%}

        ;-- grab the next three chars, starting with %, into enhexed-byte
        copy enhexed-byte 3 skip (

            ;-- If we get to this point in the match rule, this parenthesized
            ;-- expression lets us evaluate non-dialected Rebol code to 
            ;-- append the dehexed byte to our utf8 binary
            append utf8-bytes dehex enhexed-byte
        )
    ]
] [
    print [field-name {is} to string! utf8-bytes]
] [
    print {Malformed input.}
]

(Note also that "simple parse" is getting the axe in favor of enhancements to SPLIT. So writing code like parse data "=" can now be expressed instead as split data "=", or other cool variants if you check them out...samples are in the ticket.)

like image 32
HostileFork says dont trust SE Avatar answered Oct 25 '22 02:10

HostileFork says dont trust SE