Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I force web browsers to send form text as UTF8?

I want to standardise on UTF8 on our web site. All our databases and internet stuff is in UTF8. All our web servers are sending the charset=utf-8 HTTP header. However I've discovered that by changing the encoding in my Firefox (View -> Character Encoding) to something else I can enter a Latin-9 character into a form and PHP just treats them as malformed UTF8.

How much do I have to worry about that? Is it possible for the user's web browser to override the UTF8 charset header and send non-UTF8?

Update: Several people have suggested accept-charset on the individual forms. However I'd rather not have to change every web form. Assuming I can control the HTTP content-type header, and it's set to UTF8, do I have anything to worry about?

like image 201
Amandasaurus Avatar asked Jun 29 '09 10:06

Amandasaurus


2 Answers

Is it possible for the user's web browser to override the utf8 charset header and send non-UTF8?

Of course. You don't control the client, and the client can do whatever it wants, including letting users override the normal encodings and cause junk (or what passes for junk) to be sent to your server.

That said, it sounds like you've taken most of important steps here. Your actual HTML document is UTF-8 encoded and explicitly marked as such, which means that browsers will generally default to submitting forms in that encoding also. (Note that the HTML spec doesn't require this. Specifying the accept-charset on the form explicitly is the only spec-compliant guarantee.) I suspect that this will work as expected in all modern browsers, and you could test this easily.

On the server, your job is always to validate your input to the extent that it's important to your service. Although the vast majority of your users will be benevolent and using modern standard browsers, the HTTP protocol is open, and both wacky users and malicious hackers are out there, and both can throw any kind of data they want at you. Make sure that you're not making assumptions about data encodings when security or authenticated data is involved, and sanitize this stuff before you shove it into databases.

like image 51
Ben Zotto Avatar answered Oct 09 '22 05:10

Ben Zotto


I think the best solution is to convert to UTF-8 and handle any non-UTF-8 characters when the user submits data. As noted above, the accept-charset="UTF-8" will not guarantee that data is UTF-8. And, if you have to change the forms all over your site then it is not a good solution.

So, processing the input upon submission might be a better way.

like image 37
B Seven Avatar answered Oct 09 '22 04:10

B Seven