I'm building a web service and have a node that accepts a POST to create a new resource. The resource expects one of two content-types - an XML format I'll be defining, or form-encoded variables.
The idea is that consuming applications can POST XML directly and benefit from better validation etc., but there's also an HTML interface that will POST the form-encoded stuff. Obviously the XML format has a charset declaration, but I can't see how I detect the form's charset just from looking at the POST.
A typical post to the form from Firefox looks like this:
POST /path HTTP/1.1 Host: www.myhostname.com User-Agent: Mozilla/5.0 [...etc...] Accept: text/html,application/xhtml+xml, [...etc...] Accept-Language: en-gb,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Content-Type: application/x-www-form-urlencoded Content-Length: 41 field1=value1&field2=value2&field3=value3
Which doesn't seem to contain any useful indication of the character set.
From what I can see, the application/x-www-form-urlencoded type is entirely defined in HTML, which just lays out the %-encoding rules, but doesn't say anything about what charset the data should be in.
Basically, is there any way of telling the character set if I don't know the character set the HTML originally presented was? Otherwise I'll have to try and guess the character set based on what chars are present, and that's always a bit iffy from what I can tell.
One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).
The charset parameter It is very important to always label Web documents explicitly. HTTP 1.1 says that the default charset is ISO-8859-1.
The HTTP request and response body are encoded using the text encoding specified in the charset attribute of the Content-Type header.
Use the header() function before generating any content, e.g.: header('Content-type: text/html; charset=utf-8');
the default encoding of a HTTP POST is ISO-8859-1.
else you have to look at the Content-Type header that will then look like
Content-Type: application/x-www-form-urlencoded ; charset=UTF-8
You can maybe declare your form with
<form enctype="application/x-www-form-urlencoded;charset=UTF-8">
or
<form accept-charset="UTF-8">
to force the encoding.
Some references :
http://www.htmlhelp.com/reference/html40/forms/form.html
http://www.w3schools.com/tags/tag_form.asp
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With