Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Some Unicode space category characters are not being replaced by TRANSLATE or REPLACE

I have the defined a function dbo.ufnRemoveSpaceCharacters to remove all characters under the Unicode category Zs (Space Characters). All 17 of the characters able to be replaced/translated using REPLACE and TRANSLATE when they are the only character in the string (NCHAR(1) or NVARCHAR(1)), but when the characters are combined into a string, such as a a (where the space/dash is the Ogham Space Mark 0x1680), the character is NOT replaced. The following characters are not replaced/translated Ogham Space Mark 0x1680, Narrow No Break Space 0x202F, Medium Mathematical Space 0x205F.

DROP FUNCTION IF EXISTS dbo.ufnRemoveSpaceCharacters
GO

CREATE FUNCTION dbo.ufnRemoveSpaceCharacters(@Value NVARCHAR(MAX))  
RETURNS NVARCHAR(MAX)
AS   
BEGIN  
    DECLARE @Space NCHAR = NCHAR(0x0020);
    DECLARE @NoBreakSpace NCHAR = NCHAR(0x00A0);
    DECLARE @OghamSpaceMark NCHAR = NCHAR(0x1680);
    DECLARE @EnQuad NCHAR = NCHAR(0x2000);
    DECLARE @EmQuad NCHAR = NCHAR(0x2001);
    DECLARE @EnSpace NCHAR = NCHAR(0x2002);
    DECLARE @EmSpace NCHAR = NCHAR(0x2003);
    DECLARE @ThreePerEmSpace NCHAR = NCHAR(0x2004);
    DECLARE @FourPerEmSpace NCHAR = NCHAR(0x2005);
    DECLARE @SixPerEmSpace NCHAR = NCHAR(0x2006);
    DECLARE @FigureSpace NCHAR = NCHAR(0x2007);
    DECLARE @PunctuationSpace NCHAR = NCHAR(0x2008);
    DECLARE @ThinSpace NCHAR = NCHAR(0x2009);
    DECLARE @HairSpace NCHAR = NCHAR(0x200A);
    DECLARE @NarrowNoBreakSpace NCHAR = NCHAR(0x202F);
    DECLARE @MediumMathematicalSpace NCHAR = NCHAR(0x205F);
    DECLARE @IdeographicSpace NCHAR = NCHAR(0x3000);

    DECLARE @SpaceCharacters NVARCHAR(17) =
        @Space + @NoBreakSpace + @OghamSpaceMark + @EnQuad 
      + @EmQuad + @EnSpace + @EmSpace + @ThreePerEmSpace 
      + @FourPerEmSpace + @SixPerEmSpace + @FigureSpace 
      + @PunctuationSpace + @ThinSpace + @HairSpace 
      + @NarrowNoBreakSpace + @MediumMathematicalSpace + @IdeographicSpace;

    DECLARE @ReplacePhrase NVARCHAR(17) = N'aaaaaaaaaaaaaaaaa';

    RETURN REPLACE(TRANSLATE(@Value, @SpaceCharacters, @ReplacePhrase), @Space,  '')
END
GO

You can reproduce this by using the following

DECLARE @Space NCHAR = NCHAR(0x0020);
DECLARE @NoBreakSpace NCHAR = NCHAR(0x00A0);
DECLARE @OghamSpaceMark NCHAR = NCHAR(0x1680);
DECLARE @EnQuad NCHAR = NCHAR(0x2000);
DECLARE @EmQuad NCHAR = NCHAR(0x2001);
DECLARE @EnSpace NCHAR = NCHAR(0x2002);
DECLARE @EmSpace NCHAR = NCHAR(0x2003);
DECLARE @ThreePerEmSpace NCHAR = NCHAR(0x2004);
DECLARE @FourPerEmSpace NCHAR = NCHAR(0x2005);
DECLARE @SixPerEmSpace NCHAR = NCHAR(0x2006);
DECLARE @FigureSpace NCHAR = NCHAR(0x2007);
DECLARE @PunctuationSpace NCHAR = NCHAR(0x2008);
DECLARE @ThinSpace NCHAR = NCHAR(0x2009);
DECLARE @HairSpace NCHAR = NCHAR(0x200A);
DECLARE @NarrowNoBreakSpace NCHAR = NCHAR(0x202F);
DECLARE @MediumMathematicalSpace NCHAR = NCHAR(0x205F);
DECLARE @IdeographicSpace NCHAR = NCHAR(0x3000);

DECLARE @SpaceCharacters NVARCHAR(17) =
    @Space + @NoBreakSpace + @OghamSpaceMark + @EnQuad 
    + @EmQuad + @EnSpace + @EmSpace + @ThreePerEmSpace 
    + @FourPerEmSpace + @SixPerEmSpace + @FigureSpace 
    + @PunctuationSpace + @ThinSpace + @HairSpace 
    + @NarrowNoBreakSpace + @MediumMathematicalSpace + @IdeographicSpace;

SELECT REPLACE(@OghamSpaceMark, @OghamSpaceMark, 'a'), 
       TRANSLATE(@OghamSpaceMark, @OghamSpaceMark, 'a'),
       REPLACE(@Space + @OghamSpaceMark, @OghamSpaceMark, 'a'), 
       TRANSLATE(@Space + @OghamSpaceMark + @Space, @OghamSpaceMark, 'a'),
       dbo.ufnRemoveSpaceCharacters(@SpaceCharacters),
       REPLACE(@SpaceCharacters, @OghamSpaceMark, 'a')

with output

(No column name) (No column name) (No column name) (No column name) (No column name) (No column name)
a a a aa aaaaaaaaaaa a                

I also ran the following to provide additional info

SELECT os_language_version,
       SERVERPROPERTY('LCID') AS 'Instance-LCID',
       SERVERPROPERTY('Collation') AS 'Instance-Collation',
       SERVERPROPERTY('ComparisonStyle') AS 'Instance-ComparisonStyle',
       SERVERPROPERTY('SqlSortOrder') AS 'Instance-SqlSortOrder',
       SERVERPROPERTY('SqlSortOrderName') AS 'Instance-SqlSortOrderName',
       SERVERPROPERTY('SqlCharSet') AS 'Instance-SqlCharSet',
       SERVERPROPERTY('SqlCharSetName') AS 'Instance-SqlCharSetName',
       DATABASEPROPERTYEX(N'{database_name}', 'LCID') AS 'Database-LCID',
       DATABASEPROPERTYEX(N'{database_name}', 'Collation') AS 'Database-Collation',
       DATABASEPROPERTYEX(N'{database_name}', 'ComparisonStyle') AS 'Database-ComparisonStyle',
       DATABASEPROPERTYEX(N'{database_name}', 'SQLSortOrder') AS 'Database-SQLSortOrder'
FROM sys.dm_os_windows_info;

with output

os_language_version Instance-LCID Instance-Collation Instance-ComparisonStyle Instance-SqlSortOrder Instance-SqlSortOrderName Instance-SqlCharSet Instance-SqlCharSetName Database-LCID Database-Collation Database-ComparisonStyle Database-SQLSortOrder
1033 1033 SQL_Latin1_General_CP1_CI_AS 196609 52 nocase_iso 1 iso_1 NULL NULL NULL NULL
like image 409
Collin Stevens Avatar asked Sep 13 '25 02:09

Collin Stevens


1 Answers

@JeroenMostert Answered this with:

In the last line of your function, add COLLATE Latin1_General_BIN2 after @Value. Welcome to the wonderful wacky world of how collations treat character clusters; fortunately the binary collations just do as they're told.

(Unfortunately it's not permitted to mark parameters or output types with a COLLATE clause, as this would arguably be much easier than having to figure out exactly where it has to go in every other spot.)

like image 186
APB Reports Avatar answered Sep 15 '25 18:09

APB Reports