Is there a simple way to convert a utf-8 encoded varbinary(max) column to varchar(max) in T-SQL. Something like CONVERT(varchar(max), [MyDataColumn])
. Best would be a solution that does not need custom functions.
Currently, i convert the data on the client side, but this has the downside, that correct filtering and sorting is not as efficient as done server-side.
[SOLVED] => The implicit conversion of the nvarchar data type into... The implicit conversion of the nvarchar data type into varbinary (max) is not allowed. Use the CONVERT function to perform this query
Good catch regarding the MAX type Solomon! So, since UTF-8 has a broader span of characters than a varchar allow for, the only safe way would be to removing the encoding portion and use nvarchar, as I can see...
Is there a simple way to convert a utf-8 encoded varbinary (max) column to varchar (max) in T-SQL. Something like CONVERT (varchar (max), [MyDataColumn]). Best would be a solution that does not need custom functions.
Set the [data] column to use a UTF-8 collation (new in SQL Server 2019, so not an option for you) Set the [data] column to be NVARCHAR, and remove the encoding attribute of the <xml> tag, or the entire <xml> tag. Convert the incoming string into UTF-8 bytes. So the ó character is two bytes in UTF-8: 0xC3B3, which appear as ó in Windows-1252.
SQL-Server does not know UTF-8 (at least all versions you can use productivly). There is limited support starting with v2014 SP2 (and some details about the supported versions)
when reading an utf-8
encoded file from disc via BCP
(same for writing content to disc).
Important to know:
VARCHAR(x)
is not utf-8
. It is 1-byte-encoded extended ASCII, using a codepage (living in the collation) as character map.
NVARCHAR(x)
is not utf-16
(but very close to it, it's ucs-2
). This is a 2-byte-encoded string covering almost any known characters (but exceptions exist).
utf-8
will use 1 byte for plain latin characters, but 2 or even more bytes to encoded foreign charsets.
A VARBINARY(x)
will hold the utf-8
as a meaningless chain of bytes.
A simple CAST
or CONVERT
will not work: VARCHAR
will take each single byte as a character. For sure this is not the result you would expect. NVARCHAR
would take each chunk of 2 bytes as one character. Again not the thing you need.
You might try to write this out to a file and read it back with BCP
(v2014 SP2 or higher). But the better chance I see for you is a CLR function.
Following solution should work for any encoding.
There is a tricky way of doing exactly what the OP asks. Edit: I found the same method discussed on SO (SQL - UTF-8 to varchar/nvarchar Encoding issue)
The process goes like this:
SELECT
CAST(
'<?xml version=''1.0'' encoding=''utf-8''?><![CDATA[' --start CDATA
+ REPLACE(
LB.LongBinary,
']]>', --we need only to escape ]]>, which ends CDATA section
']]]]><![CDATA[>' --we simply split it into two CDATA sections
) + ']]>' AS XML --finish CDATA
).value('.', 'nvarchar(max)')
Why it works: varbinary and varchar are the same string of bits - only the interpretation differs, so the resulting xml truly is utf8 encoded bitstream and the xml interpreter is than able to reconstruct the correct utf8 encoded characters.
BEWARE the 'nvarchar(max)'
in the value
function. If you used varchar
, it would destroy multi-byte characters (depending on your collation).
BEWARE 2 XML cannot handle some characters, i.e. 0x2. When your string contains such characters, this trick will fail.
This is simple. Create another database with UTF8 collation as the default one. Create function that converts VARBINARY
to VARCHAR
. The returned VARCHAR
will have that UTF8
collation of the database.
This is another simple trick. Create a table with one column VARCHAR COLLATE ...UTF8
. Insert the VARBINARY
data into this table. It will get saved correctly as UTF8
VARCHAR
. It is sad that memory optimized tables cannot use UTF8
collations...
(don't use this, it is unnecessary, see Plain insert trick)
I was trying to come up with an approach using SQL Server 2019's Utf8 collation and I have found one possible method so far, that should be faster than the XML trick (see below).
drop table if exists
#bin,
#utf8;
create table #utf8 (UTF8 VARCHAR(MAX) COLLATE Czech_100_CI_AI_SC_UTF8);
create table #bin (BIN VARBINARY(MAX));
insert into #utf8 (UTF8) values ('Žluťoučký kůň říčně pěl ďábelské ódy za svitu měsíce.');
insert into #bin (BIN) select CAST(UTF8 AS varbinary(max)) from #utf8;
select * from #utf8; --here you can see the utf8 string is stored correctly and that
select BIN, CAST(BIN AS VARCHAR(MAX)) from #bin; --utf8 binary is converted into gibberish
alter table #bin alter column BIN varchar(max) collate Czech_100_CI_AI_SC_UTF8;
select * from #bin; --voialá, correctly converted varchar
alter table #bin alter column BIN nvarchar(max);
select * from #bin; --finally, correctly converted nvarchar
The test:
@TextLengthMultiplier
determines length of the converted text@TextAmount
determines how many of them at once will be converted------------------
--TEST SETUP
--DECLARE @LongText NVARCHAR(MAX) = N'český jazyk, Tiếng Việt, русский язык, 漢語, ]]>';
--DECLARE @LongText NVARCHAR(MAX) = N'JUST ASCII, for LOLZ------------------------------------------------------';
DECLARE
@TextLengthMultiplier INTEGER = 100000,
@TextAmount INTEGER = 10;
---------------------
-- TECHNICALITIES
DECLARE
@StartCDATA DATETIME2(7), @EndCDATA DATETIME2(7),
@StartTable DATETIME2(7), @EndTable DATETIME2(7),
@StartDB DATETIME2(7), @EndDB DATETIME2(7),
@StartInsert DATETIME2(7), @EndInsert DATETIME2(7);
drop table if exists
#longTexts,
#longBinaries,
#CDATAConverts,
#DBConverts,
#INsertConverts;
CREATE TABLE #longTexts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #longBinaries (LongBinary VARBINARY(MAX) NOT NULL);
CREATE TABLE #CDATAConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #DBConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #InsertConverts (LongText VARCHAR (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
insert into #longTexts --make the long text longer
(LongText)
select
REPLICATE(@LongText, @TextLengthMultiplier)
from
TESTES.dbo.Numbers --use while if you don't have number table
WHERE
Number BETWEEN 1 AND @TextAmount; --make more of them
insert into #longBinaries (LongBinary) select CAST(LongText AS varbinary(max)) from #longTexts;
--sanity check...
SELECT TOP(1) * FROM #longTexts;
------------------------------
--MEASURE CDATA--
SET @StartCDATA = SYSDATETIME();
INSERT INTO #CDATAConverts
(
LongText
)
SELECT
CAST(
'<?xml version=''1.0'' encoding=''utf-8''?><![CDATA['
+ REPLACE(
LB.LongBinary,
']]>',
']]]]><![CDATA[>'
) + ']]>' AS XML
).value('.', 'Nvarchar(max)')
FROM
#longBinaries AS LB;
SET @EndCDATA = SYSDATETIME();
--------------------------------------------
--MEASURE ALTER TABLE--
SET @StartTable = SYSDATETIME();
DROP TABLE IF EXISTS #AlterConverts;
CREATE TABLE #AlterConverts (UTF8 VARBINARY(MAX));
INSERT INTO #AlterConverts
(
UTF8
)
SELECT
LB.LongBinary
FROM
#longBinaries AS LB;
ALTER TABLE #AlterConverts ALTER COLUMN UTF8 VARCHAR(MAX) COLLATE Czech_100_CI_AI_SC_UTF8;
--ALTER TABLE #AlterConverts ALTER COLUMN UTF8 NVARCHAR(MAX);
SET @EndTable = SYSDATETIME();
--------------------------------------------
--MEASURE DB--
SET @StartDB = SYSDATETIME();
INSERT INTO #DBConverts
(
LongText
)
SELECT
FUNCTIONS_ONLY.dbo.VarBinaryToUTF8(LB.LongBinary)
FROM
#longBinaries AS LB;
SET @EndDB = SYSDATETIME();
--------------------------------------------
--MEASURE Insert--
SET @StartInsert = SYSDATETIME();
INSERT INTO #INsertConverts
(
LongText
)
SELECT
LB.LongBinary
FROM
#longBinaries AS LB;
SET @EndInsert = SYSDATETIME();
--------------------------------------------
-- RESULTS
SELECT
DATEDIFF(MILLISECOND, @StartCDATA, @EndCDATA) AS CDATA_MS,
DATEDIFF(MILLISECOND, @StartTable, @EndTable) AS ALTER_MS,
DATEDIFF(MILLISECOND, @StartDB, @EndDB) AS DB_MS,
DATEDIFF(MILLISECOND, @StartInsert, @EndInsert) AS Insert_MS;
SELECT TOP(1) '#CDATAConverts ', * FROM #CDATAConverts ;
SELECT TOP(1) '#DBConverts ', * FROM #DBConverts ;
SELECT TOP(1) '#INsertConverts', * FROM #INsertConverts;
SELECT TOP(1) '#AlterConverts ', * FROM #AlterConverts ;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With