Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with Unicode replacement character � (0xFFFD / 65533) in SQL

I was hardly even aware that the Unicode replacement character (�) existed a week ago. Now I'm learning that there seems to be some very special and strange logic surrounding it at least in SQL. For example:

select replace(N'bl' + NCHAR(65533) + N'rt', NCHAR(65533), N'X')

returns bl�rt instead of blXrt. And:

select CHARINDEX(NCHAR(65533), N'b' + NCHAR(65533) + N't')

returns 0 instead of 2. I'm just trying to determine which strings in a table contain this character, and I can't find a straightforward way to do it. The treatment of this character is so strange, there must be more I can learn about it. Where is the behavior defined, and more specifically, what is the easiest way to locate strings in an MS SQL Server database that contain this character?

EDIT For anyone experimenting with answers, I suggest testing your answer on the following data:

create table Test([Value] nvarchar(100) not null)
insert into Test([Value]) values('b' + NCHAR(65533) + 't')
insert into Test([Value]) values('b?t')
insert into Test([Value]) values('bat')
like image 787
BlueMonkMN Avatar asked May 14 '15 14:05

BlueMonkMN


People also ask

How does SQL Server handle Unicode characters?

SQL Server UNICODE() Function The UNICODE() function returns an integer value (the Unicode value), for the first character of the input expression.

How do you handle special characters in SQL?

Use braces to escape a string of characters or symbols. Everything within a set of braces in considered part of the escape sequence. When you use braces to escape a single character, the escaped character becomes a separate token in the query. Use the backslash character to escape a single character or symbol.


1 Answers

Krzysztof Kozielczyk wrote that the valid unicode characters need to be cast into a binary string in order to be replaced, so that may be the answer to your initial question.

SELECT REPLACE(N'test' + NCHAR(65533) 
COLLATE Latin1_General_BIN, NCHAR(65533) COLLATE Latin1_General_BIN, '')

The above code also points to how to locate strings with valid unicode characters, but it's a workaround rather than a solution. source

like image 122
SQLHound Avatar answered Sep 24 '22 00:09

SQLHound