Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to sort non-english strings?

I did look up answers, and they are good for the standard alphabet. but I have a different situation than that.

so, I am programming in Java. I am writing a certain program. this program has at some place some list of string items. I would like to sort those string items according to the alphabet.

if I would sort it by English alphabet, it would be easy since usually all code pages are compatible with American standard code for information interchange (ASCII), and they have all letters of English alphabet already sorted, so, if I would like to sort my list, I would only have to compare the values of chars to determine which letter goes where.

but my problem is, that I do not want to sort a list by using the English alphabet. my program has the option to display in English or some other languages. the problem is that some of those languages have different alphabet from the English alphabet, therefore letters are not the same as those in the English alphabet, and thus simple <, and > validation of char values does not work because letters are not sorted correctly in the code page.

for the purposes of this question lets say English alphabet is as follows:

a,
b,
c,
d,
e,
f,
g.

let's say there is a certain country named "ABC" whose alphabet goes like this:

d,
b,
g,
e,
a,
c,
f.

first of all, if a is equal to 97 on code page, b 98, c 99 et cetera, how can I sort my list using the second alphabet in this example, since the second alphabet has its first letter equal to 100, second equal to 98, third to 103 et cetera?

and my second question: unfortunately, some of the countries I am translating my program too has alphabet where some combinations of letters are treated as one letter. for my second example, let's say that country "def" has the following alphabet:

d,
g,
be,
e,
fe,
c,
f.

here: d - the first letter in the alphabet, g - second letter in the alphabet, be - third letter in the alphabet (ONE letter, although it is written as two letters, it is considered to be just one letter, and has its position in the alphabet), e - fourth letter in the alphabet, the - fifth letter in the alphabet (also written as two letters, but treated as ONE letter), c - sixth letter in the alphabet, f - seventh letter in the alphabet.

as you can see in this imaginary example number 2 of imaginary country "def", this country has really screwed up the alphabet. and after presenting these two examples of these two alphabets of two imaginary countries, you understand why I cannot use the standard method for sorting strings.

so, can you please help me out with this sorting. I am not sure what I can do to sort according to this screwed up alphabet.

post scriptum: lines below this are not important for the question, but they are just more info if anyone wants to know where I have found such screwed up alphabet

well, i gave those examples which consists of 7 randomly ordered letters just for the purpose of this question - to make it more simple. in case you wonder, what my real problem is - i am trying to translate my program to croatian. croatian alphabet is really screwed up because it goes as follows:

1 |a
2 |b
3 |c
4 |č
5 |ć
6 |d
7 |đ
8 |đž
9 |e
10|f
11|g
12|h
13|i
14|j
15|k
16|l
17|lj
18|m
19|n
20|nj
21|o
22|p
23|r
24|s
25|š
26|t
27|u
28|v
29|z
30|ž

as you can see, Croatian alphabet is somewhat similar to the English alphabet, but most of the letters are not at the same location as English ones, and several of them do not exist in English alphabet at all, and several letters are one letter which is written as two letters. so really difficult to sort. so I hope someone knows some method of doing it. of course, there is the dumbest method for sorting which will always work and can sort anything, and that is method with switch statement, where I compare two string items, and for each letter i use switch statement where switch statement has 31+default=32 cases from which, each of them has its own switch with 32 cases. what is in total 1024 cases, and if my average case has 4 lines of code, I end up that if I want to sort strings using the non-English alphabet, that my sort method would be at least 4096 lines long. and that is a huge method. this is the dumbest way of sorting, but only one I can figure out at the moment. so I am asking here because I hope someone would know any simpler method to do this. the method which is not so big as 4k lines of code just to sort stupid strings. I have a method for sorting English strings and it takes up only a bit more than 10 lines of code. I hope someone can suggest me something less than 4k lines of code.

so if anyone knows the simpler solution, I would appreciate it.

thanx.

like image 468
SYOB SYOT Avatar asked Jan 04 '23 21:01

SYOB SYOT


2 Answers

You use a Collator for that. Collators are Java's way to handle internationalized comparisons.

List<String> mylist = ...;
Locale croatian = new Locale("hr", "HR");
// Put whatever Locale you need as the argument to the getInstance method.
Collator collator = Collator.getInstance(croatian);
Collections.sort(mylist, collator);

Local is not just "language" but also many other conventions. It is possible for the same language to be sorted differently depending on the country or region or convention within the country - that's why a Locale is identified by at most 3 parts: "country", "region" and "variant".

like image 191
Erwin Bolwidt Avatar answered Jan 07 '23 12:01

Erwin Bolwidt


The concept is called collation. You can look up the concept to know more about it. For example, Oracle/Sun has a tutorial about this concept:

https://docs.oracle.com/javase/tutorial/i18n/text/rule.html

like image 42
leeyuiwah Avatar answered Jan 07 '23 12:01

leeyuiwah