I want to store some data with a html form and Rebol cgi. My form looks like this:
<form action="test.cgi" method="post" >
Input:
<input type="text" name="field"/>
<input type="submit" value="Submit" />
</form>
But for unicode characters like Chinese, I get the encoded form of the data with percent signs, for instance %E4%BA%BA
.
(This is for the Chinese character "人" ... its UTF-8 form as a Rebol binary literal is #{E4BABA}
)
Is there a function in the system, or an existing library that can decode this directly? dehex
does not appear to currently cover this case. I'm currently decoding this manually by removing the percent signs and constructing the corresponding binary, like this:
data: to-string read system/ports/input
print data
;-- this prints "field=%E4%BA%BA"
k-v: parse data "="
print k-v
;-- this prints ["field" "%E4%BA%BA"]
v: append insert replace/all k-v/2 "%" "" "#{" "}"
print v
;-- This prints "#{E4BABA}" ... a string!, not binary!
;-- LOAD will help construct the corresponding binary
;-- then TO-STRING will decode that binary from UTF-8 to character codepoints
write %test.txt to-string load v
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
UTF-8 encodes Unicode characters into a sequence of 8-bit bytes. The standard has a capacity for over a million distinct codepoints and is a superset of all characters in widespread use today. By comparison, ASCII (American Standard Code for Information Interchange) includes 128 character codes.
I have a library called AltWebForm that en/decodes percent-encoded web form data:
do http://reb4.me/r3/altwebform
load-webform "field=%E4%BA%BA"
The library is described here: Rebol and Web Forms.
Looks to be related to ticket #1986, where it is discussed whether this is a "bug" or the Internet changing out from under its own spec:
Have DEHEX convert UTF-8 sequences from browsers as Unicode.
If you have specific experience on what has become standard in Chinese, and want to weigh in, that would be valuable.
Just as an aside, the specific case above could have been handled in PARSE alternately as:
key-value: {field=%E4%BA%BA}
utf8-bytes: copy #{}
either parse key-value [
copy field-name to {=}
skip
some [
and {%}
copy enhexed-byte 3 skip (
append utf8-bytes dehex enhexed-byte
)
]
] [
print [field-name {is} to string! utf8-bytes]
] [
print {Malformed input.}
]
That will output:
field is 人
With some comments included:
key-value: {field=%E4%BA%BA}
;-- Generate empty binary value by copying an empty binary literal
utf8-bytes: copy #{}
either parse key-value [
;-- grab field-name as the chars right up to the equals sign
copy field-name to {=}
;-- skip the equal sign as we went up to it, without moving "past" it
skip
;-- apply the enclosed rule SOME (non-zero) number of times
some [
;-- match a percent sign as the immediate next symbol, without
;-- advancing the parse position
and {%}
;-- grab the next three chars, starting with %, into enhexed-byte
copy enhexed-byte 3 skip (
;-- If we get to this point in the match rule, this parenthesized
;-- expression lets us evaluate non-dialected Rebol code to
;-- append the dehexed byte to our utf8 binary
append utf8-bytes dehex enhexed-byte
)
]
] [
print [field-name {is} to string! utf8-bytes]
] [
print {Malformed input.}
]
(Note also that "simple parse" is getting the axe in favor of enhancements to SPLIT. So writing code like parse data "="
can now be expressed instead as split data "="
, or other cool variants if you check them out...samples are in the ticket.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With