Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert UTF-8 varbinary(max) to varchar(max)

I have a varbinary(max) column with UTF-8-encoded text that has been compressed. I would like to decompress this data and work with it in T-SQL as a varchar(max) using the UTF-8 capabilities of SQL Server.

I'm looking for a way of specifying the encoding when converting from varbinary(max) to varchar(max). The only way I've managed to do that is by creating a table variable with a column with a UTF-8 collation and inserting the varbinary data into it.

DECLARE @rv TABLE(
    Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8 
)

INSERT INTO @rv
SELECT SUBSTRING(Decompressed, 4, DATALENGTH(Decompressed) - 3) WithoutBOM
FROM
    (SELECT DECOMPRESS(RawResource) AS Decompressed FROM Resource) t

I'm wondering if there is a more elegant and efficient approach that does not involve inserting into a table variable.

UPDATE:

Boiling this down to a simple example that doesn't deal with byte order marks or compression:

I have the string "Hello 😊" UTF-8 encoded without a BOM stored in variable @utf8Binary

DECLARE @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A

Now I try to assign that into various char-based variables and print the result:

DECLARE @brokenVarChar varchar(max) = CONVERT(varchar(max), @utf8Binary)
print '@brokenVarChar = ' + @brokenVarChar

DECLARE @brokenNVarChar nvarchar(max) = CONVERT(varchar(max), @utf8Binary)
print '@brokenNVarChar = ' +  @brokenNVarChar 

DECLARE @rv TABLE(
    Res varchar(max) COLLATE Latin1_General_100_CI_AS_SC_UTF8 
)

INSERT INTO @rv
select @utf8Binary

DECLARE @working nvarchar(max)
Select TOP 1 @working = Res from @rv

print '@working = ' + @working

The results of this are:

@brokenVarChar = Hello 😊
@brokenNVarChar = Hello 😊
@working = Hello 😊

So I am able to get the binary result properly decoded using this indirect method, but I am wondering if there is a more straightforward (and likely efficient) approach.

like image 537
John Stairs Avatar asked Oct 14 '20 13:10

John Stairs


3 Answers

There is an undocumented hack:

DECLARE @utf8 VARBINARY(MAX)=0x48656C6C6F20F09F988A;

SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utf8,']]>') AS XML)
       .value('.','nvarchar(max)');

The result

Hello 😊

This works even in versions without the new UTF8 collations...

UPDATE: calling this as a function

This can easily be wrapped in a scalar function

CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(@utfBinary VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
    RETURN
    (
    SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utfBinary,']]>') AS XML)
           .value('.','nvarchar(max)')
    );
END
GO

Or like this as an inlined table valued function

CREATE FUNCTION dbo.Convert_UTF8_Binary_To_NVarchar(@utfBinary VARBINARY(MAX))
RETURNS TABLE
AS
    RETURN
    SELECT CAST(CONCAT('<?xml version="1.0" encoding="UTF-8" ?><![CDATA[',@utfBinary,']]>') AS XML)
           .value('.','nvarchar(max)') AS ConvertedString
GO

This can be used after FROM or - more appropriate - with APPLY

like image 99
Shnugo Avatar answered Sep 20 '22 15:09

Shnugo


I don't like this solution, but it's one I got to (I initially thought it wasn't working, due to what appears to be a bug in ADS). One method would be to create a new database in a UTF8 collation, and then pass the value to a function in that database. As the database is in a UTF8 collation, the default collation will be different to the local one, and the correct result will be returned:

CREATE DATABASE UTF8 COLLATE Latin1_General_100_CI_AS_SC_UTF8;
GO
USE UTF8;
GO
CREATE OR ALTER FUNCTION dbo.Bin2UTF8 (@utfbinary varbinary(MAX))
RETURNS varchar(MAX) AS
BEGIN
    RETURN CAST(@utfbinary AS varchar(MAX));
END
GO
USE YourDatabase;
GO
SELECT UTF8.dbo.Bin2UTF8(0x48656C6C6F20F09F988A);

This, however, isn't particularly "pretty".

like image 24
Larnu Avatar answered Sep 20 '22 15:09

Larnu


DECLARE @utf8Binary varbinary(max) = 0x48656C6C6F20F09F988A;
DECLARE @brokenNVarChar nvarchar(max) = concat(@utf8Binary, '' COLLATE Latin1_General_100_CI_AS_SC_UTF8);
print '@brokenNVarChar = ' +  @brokenNVarChar;
like image 43
lptr Avatar answered Sep 22 '22 15:09

lptr