Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting the character encoding of an HTTP POST request

I'm building a web service and have a node that accepts a POST to create a new resource. The resource expects one of two content-types - an XML format I'll be defining, or form-encoded variables.

The idea is that consuming applications can POST XML directly and benefit from better validation etc., but there's also an HTML interface that will POST the form-encoded stuff. Obviously the XML format has a charset declaration, but I can't see how I detect the form's charset just from looking at the POST.

A typical post to the form from Firefox looks like this:

POST /path HTTP/1.1 Host: www.myhostname.com User-Agent: Mozilla/5.0 [...etc...] Accept: text/html,application/xhtml+xml, [...etc...] Accept-Language: en-gb,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Content-Type: application/x-www-form-urlencoded Content-Length: 41  field1=value1&field2=value2&field3=value3 

Which doesn't seem to contain any useful indication of the character set.

From what I can see, the application/x-www-form-urlencoded type is entirely defined in HTML, which just lays out the %-encoding rules, but doesn't say anything about what charset the data should be in.

Basically, is there any way of telling the character set if I don't know the character set the HTML originally presented was? Otherwise I'll have to try and guess the character set based on what chars are present, and that's always a bit iffy from what I can tell.

like image 902
Ciaran McNulty Avatar asked Apr 02 '09 09:04

Ciaran McNulty


People also ask

How do you determine the encoding of a character?

One way to check this is to use the W3C Markup Validation Service. The validator usually detects the character encoding from the HTTP headers and information in the document. If the validator fails to detect the encoding, it can be selected on the validator result page via the 'Encoding' pulldown menu (example).

What character encoding is HTTP?

The charset parameter It is very important to always label Web documents explicitly. HTTP 1.1 says that the default charset is ISO-8859-1.

Are HTTP requests encoded?

The HTTP request and response body are encoded using the text encoding specified in the charset attribute of the Content-Type header.

How do I set character encoding in HTTP header?

Use the header() function before generating any content, e.g.: header('Content-type: text/html; charset=utf-8');


1 Answers

the default encoding of a HTTP POST is ISO-8859-1.

else you have to look at the Content-Type header that will then look like

Content-Type: application/x-www-form-urlencoded ; charset=UTF-8 

You can maybe declare your form with

<form enctype="application/x-www-form-urlencoded;charset=UTF-8"> 

or

<form accept-charset="UTF-8"> 

to force the encoding.

Some references :

http://www.htmlhelp.com/reference/html40/forms/form.html

http://www.w3schools.com/tags/tag_form.asp

like image 79
chburd Avatar answered Oct 06 '22 00:10

chburd