Is 0xF8 a valid byte in a UTF-8 encoded XML document?

Question

I am receiving a document that claims to be UTF-8 (<?xml version="1.0" encoding="UTF-8"?>). I've had some problems in the past where the encoding declaration from the sender has not been all that reliable (i.e. documents are declared to have a given encoding when in fact they do not), so I try to check using http://utf8checker.codeplex.com/ According to this tool, a 0xF8 byte means that this document is not UTF-8 encoded.

However, to the contrary, this page lists the Norwegian character 'ø' as being represented in UTF-8 as 0xF8. (The page is in Norwegian, however, the data I am referring to stems from the table at the bottom of the page.)

Can anyone help me sort this out? I'm feeling rather confused here.

Thanks!

Joey · Accepted Answer

ø is U+00F8 and since it is not in ASCII it cannot be a single UTF-8 code unit. It is represented by 0xC3 0xB8 in UTF-8. Therefore, if you have 0xF8 standing alone in a document somewhere, yes, it is invalid UTF-8.

It seems that the document uses either Latin-1 or the Windows code page 1252.

beetstra · Answer

I don't think that page is very reliable, it also says "UTF-8 = UCS-1".

Checking Wikipedia, F8 can only be used as the first byte of a 5 byte UTF-8 sequence, but currently no Unicode characters exist which would require 5 byte encoding. So no.

Is 0xF8 a valid byte in a UTF-8 encoded XML document?

Tags:

c#

xml

encoding

utf-8

Eyvind

2 Answers

Joey

beetstra

Recent Activity

Donate For Us

Is 0xF8 a valid byte in a UTF-8 encoded XML document?

Tags:

c#

xml

encoding

utf-8

Eyvind

2 Answers

Joey

beetstra

Related questions

Recent Activity

Donate For Us