Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How characters are transmitted over a form?

<head>
<meta charset="ISO-8859-7">
</head>

I've been working with forms and see that the <meta charset="ISO-8859-7"> tag encode the text that will be typed within a text area. Thing that the encoding method used to store the file isn't does.

I've saw that if a character typed isn't part of the encoding speciefied by the <meta charset="ISO-8859-7"> tag, the character will be referenced (&#D;)

I was supposing that the form was sending bytes sequences from the encoding speciefied. Cuz if i type a character whatever it is, will be a byte that an encoding will interpret.

For example with the <meta charset="ISO-8859-7"> i type in a form the character "¥"

This char isn't part of the encoding but it must send as a byte of the position that it represents A5, no matter if it can be represented (This is maked normally by any editor).

But not, the form don't send it as a byte, rather the character is referenced.

Code:

index.php:

<?php header('Content-Type: text/html; charset=ISO-8859-7'); ?>

<head>
    <meta charset="ISO-8859-7">
</head>
<form method="post" action="encode.php" accept-charset="ISO-8859-7">
    <p><textarea name="input" maxlength="10" rows="5" cols="100"></textarea></p>
    <p><button>Submit</button></p>
</form>

encode.php:

<head>
    <meta charset="ISO-8859-7"><!-- Useless, Even if is specified the ISO-8859-1 where the "¥" exist, the form sended a reference char rather an a byte to interpret.-->
</head>
<?php
    $input=$_POST["input"];
    var_dump($input);
?>

Result in Sourcecode:

string(6) "&#165;"

Note: I've tested changing the Encoding used to store the file.

in the index.php: Doesn't matter what encoding is used to store the file, the form always will send accordingly with the accept-charset="" attribute or with the <meta charset=""> tag if the accept-charset="" is not specified.

And with the encode.php: The string is never encoded by the file. Can be worked and represented, but the encoding used to store the file has nothing to do with that.

like image 659
nEAnnam Avatar asked Oct 10 '22 10:10

nEAnnam


1 Answers

The problem is that the typed character is not supported by the form encoding.

As far as I can see, neither HTML 4 nor HTML 5 specifies what the browser should do, if the user enters a character in a form field that is not supported by the form encoding.

HTML 5 does specify that unsupported characters should be replaced by an ASCII ? in the query part of URLs¹ (and thus in GET form submissions?), but I can't find anything for POST forms.

It seems that all the browsers (or at least IE, FF, Chrome, Opera) have agreed on encoding unsupported characters as an XML entity. (A better approach would probably have been to warn the user and prevent form submission, but that's water under the bridge.)

The solution is, of course, to use UTF-8 all the way. Then all characters are supported by the encoding, and this problem doesn't arise.


¹ 2.6.3 Resolving URLs. HTML 5, W3C Working Draft 25 May 2011, item 8.1:

If the character in question cannot be expressed in the encoding encoding, then replace it with a single 0x3F octet (an ASCII question mark) [...]

Fun fact: The above only applies to the query part (the part after the question mark) of the IRI. The path portion is always encoded using UTF-8. And the host name is of course encoded using Punycode. The mind boggles.

like image 183
Søren Løvborg Avatar answered Oct 20 '22 05:10

Søren Løvborg