Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is 0xF8 a valid byte in a UTF-8 encoded XML document?

I am receiving a document that claims to be UTF-8 (<?xml version="1.0" encoding="UTF-8"?>). I've had some problems in the past where the encoding declaration from the sender has not been all that reliable (i.e. documents are declared to have a given encoding when in fact they do not), so I try to check using http://utf8checker.codeplex.com/ According to this tool, a 0xF8 byte means that this document is not UTF-8 encoded.

However, to the contrary, this page lists the Norwegian character 'ø' as being represented in UTF-8 as 0xF8. (The page is in Norwegian, however, the data I am referring to stems from the table at the bottom of the page.)

Can anyone help me sort this out? I'm feeling rather confused here.

Thanks!

like image 722
Eyvind Avatar asked Jan 26 '11 18:01

Eyvind


2 Answers

ø is U+00F8 and since it is not in ASCII it cannot be a single UTF-8 code unit. It is represented by 0xC3 0xB8 in UTF-8. Therefore, if you have 0xF8 standing alone in a document somewhere, yes, it is invalid UTF-8.

It seems that the document uses either Latin-1 or the Windows code page 1252.

like image 110
Joey Avatar answered Sep 26 '22 20:09

Joey


I don't think that page is very reliable, it also says "UTF-8 = UCS-1".

Checking Wikipedia, F8 can only be used as the first byte of a 5 byte UTF-8 sequence, but currently no Unicode characters exist which would require 5 byte encoding. So no.

like image 35
beetstra Avatar answered Sep 26 '22 20:09

beetstra