Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server - defining an XML type column with UTF-8 encoding

The default encoding for an XML type field defined in an SQL Server is UTF-16. I have no trouble inserting into that field with UTF-16 encoded XML streams.

But if I tried to insert into the field with UTF-8 encoded XML stream, the insert attempt would receive the error response
unable to switch encoding.

QUESTION: Is there a way to define a SQL Server column/field as having UTF-8 encoding?

Further info

The insertion operations are performed using Spring JDBCTemplate.

The XML Stream was produced by JAXB Marshaller set to UTF-8 or UTF-16 encoding.

private String marshall(myDAO myTao, JAXBEncoding jaxbEncoding)
throws JAXBException{
    JAXBContext jc = JAXBContext.newInstance(ObjectFactory.class);
    m = jc.createMarshaller();
    m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
    if (jaxbEncoding!=null)
        m.setProperty(Marshaller.JAXB_ENCODING, jaxbEncoding.toString());
    StringWriter strw = new StringWriter();
    m.marshal(myTao, strw);
    String strw.toString();
}

Where ...

public enum JAXBEncoding {
    UTF8("UTF-8"),
    UTF16("UTF-16")
    ;
    
    private String value;
    private JAXBEncoding(String value){
        this.value = value;
    }
    
    public String toString(){
        return this.value;
    }
}
like image 807
Blessed Geek Avatar asked Jan 05 '17 21:01

Blessed Geek


2 Answers

Is there a way to define a SQL Server column/field as having UTF-8 encoding?

No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don't use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS's allow.

You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:

  1. The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
  2. The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
  3. The byte sequence needs to be the actual UTF-8 bytes.

For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):

SET NOCOUNT ON;
DECLARE @XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
                    + CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
                    + '</test></root>';

SELECT @XML;
PRINT CONVERT(NVARCHAR(MAX), @XML);

Returns (in both "Results" and "Messages" tabs):

<root><test>😱</test></root>

You mentioned in a comment on @Shnugo's answer:

I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column. Would there be a hidden problem?

No, you didn't store UTF-8 encoded anything in an NVARCHAR column (besides, there is no 2013 version of SQL Server, but that is probably just a typo). NVARCHAR is only ever UTF-16 Little Endian. Most likely your UTF-8 stream got converted into UTF-16 LE by the database driver during transit into SQL Server. This is the same encoding that an XML column would use, but the XML column would have tried to convert the stream from UTF-8 into UTF-16 but failed due to it already being UTF-16. This also means that on the way out of SQL Server, the XML document stored in the NVARCHAR column would still have the XML declaration stating that the encoding is UTF-8, but it's definitely not UTF-8.

If you absolutely need the data to be UTF-8 on the way out because you don't want to convert the UTF-16 LE coming out of SQL Server XML or NVARCHAR into UTF-8, then you have no choice but to store the data as VARBINARY(MAX).

like image 130
Solomon Rutzky Avatar answered Nov 07 '22 18:11

Solomon Rutzky


As you found out correctly, XML will be stored as unicode (utf-16, well, it's ucs-2 actually). There is no other format.

Within SQL-Server there is VARCHAR(MAX) for extended ASCII (1-byte) and NVARCHAR(MAX) for unicode. Both can be casted to XML directly (as long as the string is valid XML). One must be aware, that VARCHAR(MAX) might not be able to deal with special characters... So - if this is an issue - you should stick with unicode anyway.

The problem occurs, when the encoding declaration is included within <?xml ...?>:

This works:

DECLARE @xml XML =
'<?xml version="1.0" encoding="utf-8"?>
 <root>test</root>';

SELECT @xml;

This produces an error:

DECLARE @xml XML =
'<?xml version="1.0" encoding="utf-16"?>
 <root>test</root>';

SELECT @xml;

But this works again (see the leading N before the string literal):

DECLARE @xml XML =
N'<?xml version="1.0" encoding="utf-16"?>
 <root>test</root>';

SELECT @xml;

##Fazit

If you pass the string 1-byte encoded, but declared as utf-16 (or vice-versa) you'll get into troubles. Best is, to pass your XML without the <?xml ...?>-declaration.

##UPDATE

You are mixing two things

##Encoding

From your comment:

UTF-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now...

Yes, it's correct, that UTF-8 and UTF-16 are two flavours of unicode. But it is not correct to call utf-8 the new de-facto standard. This depends heavily on your needs. Living in an english speaking country, dealing with plain latin text will save some bytes using UTF-8. Living somewhere far east will bloat your text incredibly, due to many 3 and 4 byte codes.

And - this is more important in terms of databases - the fixed width is enormously easier to handle. Just imagine a WHERE SUBSTRING(SomeUTF8Column,100,1)='A'. With utf-16 the engine can cut byte 200 and 201 without looking, with utf-8 the full string up to character 100 must be analysed to find out, where the 100th characters sits actually. I would prefer utf-8 only in cases, where band-width or storage space is an important factor... SQL Server uses a fixed width 1-byte encoding and no utf-8 actually: extended ASCII in combination with a collation.

I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column

And - this is even more important in terms of XML - XML is not stored as the text you see, rather as a hierarchy tree. You can store literally everything in (N)VARCHAR:

DECLARE @s VARCHAR(MAX)='Don''t store me, I''m UTF-16. Your machine will explode!';

This works with any combination. You can declare NVARCHAR and/or put an N in front of the literal. No problem due to implicit conversions.

But internal VARCHAR cannot deal with higher encodings!. Try this:

 DECLARE @s NVARCHAR(MAX)=N'слов в тексте';
 SELECT @s

This will work with NVARCHAR and N'Your string' only!

##XML-storage

As said before, XML is not stored as the text you see, but as a tree. Everything is optimized for performance. Therefore fixed width UTF-16. The xml-declaration is ommitted in any case...

The problem occurs, when you pass in a string which is physically encoded as utf-8 but declared as something else (or vice versa). You can pass in a real UTF-16 with a declared encoding of utf-16 (same with utf-8) without problems.

##Fazit

If you have the slightest chance to include 3 or 4 byte UTF-8 codes you should stick to UTF-16.

like image 38
Shnugo Avatar answered Nov 07 '22 19:11

Shnugo