Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL sort order in Japanese breaks when text includes non-Japanese characters

It seems that Japanese sorting "breaks" when the text contains non-japanese text, even when forcing any possible collation after the sort part of the query.

I would like to know if this is a known phenomenon, and what a solution could be.

In the end I'm look for a kana type insensitive, case sensitive sorting, while searching should be kana type insensitive and case insensitive

Here is the test case:

I would assume from the script below, that I get the same results in both queries (the expected sort order is in the third column). Basically once I sort by the complete word, and once I sort manually by the first letter, then the second and then third letter.

Given the DB collation SQL_Latin1_General_CP1_CI_AS

declare  @temp as table  (title nvarchar(5),  expected int,  script varchar(40) )

set nocount on
INSERT INTO @temp values(N'かか7', 4,'hiragana no accent')
INSERT INTO @temp values(N'がが6',7,'hiragana with accent') 
INSERT INTO @temp values(N'いい5',1,'earlier letter hiragana no accent') 
INSERT INTO @temp values(N'カカ4',3, 'katakana no accent') 
INSERT INTO @temp values(N'ガガ3',6, 'katakana with accent') 
INSERT INTO @temp values(N'かか2',2, 'hiragana no accent') 
INSERT INTO @temp values(N'がが1', 5, 'hiragana with accent')

--BAD
select unicode(left(title,1)) 'bin', * from @temp order by title  
--GOOD
select unicode(left(title,1)) 'bin', * from @temp order by left(title,1),substring(title,2,1), substring(title,3,1)

However only the second version works, the first one doesn't sort correctly:

the two result sets

It seems it has to do with the numbers in the title field, since when I remove them, I do get the same order.

declare  @temp as table  (title nvarchar(5),  expected int,  script varchar(40) )

set nocount on
INSERT INTO @temp values(N'かか', 2,'hiragana no accent')
INSERT INTO @temp values(N'がが',3,'hiragana with accent') 
INSERT INTO @temp values(N'いい',1,'earlier letter hiragana no accent') 
INSERT INTO @temp values(N'カカ',2, 'katakana no accent') 
INSERT INTO @temp values(N'ガガ',3, 'katakana with accent') 
INSERT INTO @temp values(N'かか',2, 'hiragana no accent') 
INSERT INTO @temp values(N'がが', 3, 'hiragana with accent')

--GOOD
select unicode(left(title,1)) 'bin', * from @temp order by title  
--GOOD
select unicode(left(title,1)) 'bin', * from @temp order by left(title,1),substring(title,2,1)

See here the results:

correct sort order

Does anybody have a clue why, and possibly a solution?

like image 205
Gideon Avatar asked Dec 09 '19 13:12

Gideon


People also ask

How do you sort countries in descending order in SQL?

The SQL ORDER BY Keyword The ORDER BY keyword is used to sort the result-set in ascending or descending order. The ORDER BY keyword sorts the records in ascending order by default. To sort the records in descending order, use the DESC keyword.

How do you sort alphabets in SQL?

If you want to sort based on two columns, separate them by commas. For example, ORDER BY LAST_NAME ASC, FIRST_NAME DESC; would display results sorted alphabetically by last name. If the same LAST_NAME matches multiple FIRST_NAME entries, the results of FIRST_NAME will also display in descending order.

How do I sort Z in SQL?

To do so is easy. Just add DESC, The SQL DESC keyword specifies a descending sort order for a column within in the SQL ORDER BY clause. Values are sorted in Z to A order.


1 Answers

Brute-force approach: Checking all supported collations in SQL Server:

create table ##temp(title nvarchar(5),  expected int,  script varchar(40) );

INSERT INTO ##temp values(N'かか7', 4,'hiragana no accent');
INSERT INTO ##temp values(N'がが6',7,'hiragana with accent');
INSERT INTO ##temp values(N'いい5',1,'earlier letter hiragana no accent'); 
INSERT INTO ##temp values(N'カカ4',3, 'katakana no accent');
INSERT INTO ##temp values(N'ガガ3',6, 'katakana with accent'); 
INSERT INTO ##temp values(N'かか2',2, 'hiragana no accent');
INSERT INTO ##temp values(N'がが1', 5, 'hiragana with accent');

And script:

CREATE TABLE result(collation_name NVARCHAR(1000));
DECLARE @collate_name NVARCHAR(1000);
DECLARE @sql NVARCHAR(MAX);

DECLARE c CURSOR FOR
SELECT name FROM sys.fn_helpcollations() /* where name LIKE '%japan%'*/;

OPEN c;
FETCH NEXT FROM c INTO @collate_name;

WHILE @@FETCH_STATUS = 0  
BEGIN  
     SET @sql = REPLACE(
 N'with cte as (
  select bin = unicode(left(title,1)),expected
         ,rn= row_number() over(order by title collate <collate>)
         ,collation = ''<collate>''
   from ##temp 
)
select collation
from cte
where expected = rn GROUP BY collation HAVING COUNT(*) = 7'
     , '<collate>', @collate_name);
     -- debug
     --PRINT @sql;

     INSERT INTO result(collation_name) EXEC (@sql);
     FETCH NEXT FROM c INTO @collate_name;
END 

SELECT * FROM result;

CLOSE c; 
DEALLOCATE c;

db<>fiddle demo

Result: There is no collation in SQL Server 2017 that will match "expected order".

like image 130
Lukasz Szozda Avatar answered Oct 12 '22 23:10

Lukasz Szozda