Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Defeat these dashed dashes in SQL server

I have a table that contains the names of various recording artists. One of them has a dash in their name. If I run the following:

Select artist
, substring(artist,8,1) as substring_artist
, ascii(substring(artist,8,1)) as ascii_table
, ascii('-') as ascii_dash_key /*The dash key next to zero */
, len(artist) as len_artist 
From [dbo].[mytable] where artist like 'Sleater%'

Then the following is returned. This seems to indicate that a dash (ascii 45) is being stored in the artist column

enter image description here

However, if I change the where clause to:

From [dbo].[mytable] where artist like 'Sleater' + char(45) + '%'

I get no results returned. If I copy and paste the output from the artist column into a hex editor, I can see that the dash is actually stored as E2 80 90, the Unicode byte sequence for the multi-byte hyphen character.

enter image description here

So, I'd like to find and replace such occurrences with a standard ascii hyphen, but I'm am at a loss as to what criteria to use to find these E2 80 90 hyphens?

like image 953
basinbasin Avatar asked Aug 09 '18 08:08

basinbasin


2 Answers

Your char is the hyphen, information on it here :

https://www.charbase.com/2010-unicode-hyphen

You can see that the UTF16 code is 2010 so in T-SQL you can build it with

SELECT NCHAR(2010) 

From there you can use any SQL command with that car, for example in a select like :

Select artist
From [dbo].[mytable] where artist like N'Sleater' + NCHAR(2010) + '%'

or as you want in a

REPLACE( artist, NCHAR(2010), '-' )

with a "real" dash

EDIT:

If the collation of your DB give you some trouble with the NCHAR(2010) you can also try to use the car N'‐' that you'll copy/paste from the charbase link I gave you so :

REPLACE( artist , N'‐' , '-'  ) 

that you can even take from the string here (made with the special car) so all made for you :

update mytable set artist=REPLACE( artist, N'‐' , '-' )
like image 189
Cali Avatar answered Oct 30 '22 15:10

Cali


I don't know your table definition and COLLATION but I'm almost sure that you are mixing NCHAR and CHAR types and convert unicode, multibyte characters to sinle byte representations. Take a look at this demo:

WITH Demo AS
(
  SELECT N'ABC'+NCHAR(0x2010)+N'DEF' T
)
SELECT
  T,
  CASE WHEN T LIKE 'ABC'+CHAR(45)+'%' THEN 1 ELSE 0 END [Char],
  CASE WHEN T LIKE 'ABC-%' THEN 1 ELSE 0 END [Hyphen],
  CASE WHEN T LIKE N'ABC‐%' THEN 1 ELSE 0 END [Unicode-Hyphen],--unicode hyphen us used here
  CASE WHEN T LIKE N'ABC'+NCHAR(45)+N'%' THEN 1 ELSE 0 END [NChar],
  CASE WHEN CAST(T AS varchar(MAX)) LIKE 'ABC-%' THEN 1 ELSE 0 END [ConvertedToAscii],
  ASCII(NCHAR(0x2010)) ConvertedToAscii,
  CAST(SUBSTRING(T, 4, 1) AS varbinary) VarbinaryRepresentation
FROM Demo

My results:

T       Char        Hyphen      Unicode-Hyphen NChar       ConvertedToAscii ConvertedToAscii VarbinaryRepresentation
------- ----------- ----------- -------------- ----------- ---------------- ---------------- --------------------------------------------------------------
ABC‐DEF 0           0           1              0           1                45               0x1020

UTF-8 (3 bytes) representation is the same as 2010 in unicode.

like image 23
Paweł Dyl Avatar answered Oct 30 '22 16:10

Paweł Dyl