Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I define a custom alphabet order for comparing and sorting strings in go?

Please read to the bottom before marking this as duplicate

I would like to be able to sort an array of strings (or a slice of structs based on one string value) alphabetically, but based on a custom alphabet or unicode letters.

Most times people advise using a collator that supports different pre-defined locales/alphabets. (See this answer for Java), but what can be done for rare languages/alphabets that are not available in these locale bundles?

The language I would like to use is not available in the list of languages supported and usable by Golangs's collate, so I need to be able to define a custom alphabet, or order of Unicode characters/runes for sorting.

Others suggest translate the strings into an english/ASCII sortable alphabet first, and then sort that. That's what's been suggested by a similar question in this solution done in Javascript or this solution in Ruby. But surely there must be a more efficient way to do this with Go.

Is it possible to create a Collator in Go that uses a custom alphabet/character set? Is that what func NewFromTable is for?

It seems that I should be able to use the Reorder function but it looks like this is not yet implemented in the language? The source code shows this:

func Reorder(s ...string) Option {
    // TODO: need fractional weights to implement this.
    panic("TODO: implement")
}

How can I define a custom alphabet order for comparing and sorting strings in go?

like image 899
Adam D Avatar asked Dec 23 '18 09:12

Adam D


People also ask

How do you sort a string in alphabetical order?

Using the toCharArray() method Get the required string. Convert the given string to a character array using the toCharArray() method. Sort the obtained array using the sort() method of the Arrays class. Convert the sorted array to String by passing it to the constructor of the String array.

How do you sort a string in order?

The main logic is to toCharArray() method of the String class over the input string to create a character array for the input string. Now use Arrays. sort(char c[]) method to sort character array. Use the String class constructor to create a sorted string from a char array.

How do I sort a string array in go?

1. Strings: This function is used to only sorts a slice of strings and it sorts the elements of the slice in increasing order. 2. StringsAreSorted: This function is used to check whether the given slice of strings is in sorted form(in increasing order) or not.

How do I order a slice in go?

In Go language, you can sort a slice with the help of Slice() function. This function sorts the specified slice given the provided less function. The result of this function is not stable. So for stable sort, you can use SliceStable.


1 Answers

Note beforehand:

The following solution has been cleaned up and optimized, and published as a reusable library here: github.com/icza/abcsort.

Using abcsort, custom-sorting a string slice (using a custom alphabet) is as simple as:

sorter := abcsort.New("bac")

ss := []string{"abc", "bac", "cba", "CCC"}
sorter.Strings(ss)
fmt.Println(ss)

// Output: [CCC bac abc cba]

Custom-sorting a slice of structs by one of the struct field is like:

type Person struct {
    Name string
    Age  int
}

ps := []Person{{Name: "alice", Age: 21}, {Name: "bob", Age: 12}}
sorter.Slice(ps, func(i int) string { return ps[i].Name })
fmt.Println(ps)

// Output: [{bob 12} {alice 21}]

Original answer follows:


We can implement custom sorting that uses a custom alphabet. We just need to create the appropriate less(i, j int) bool function, and the sort package will do the rest.

Question is how to create such a less() function?

Let's start by defining the custom alphabet. Convenient way is to create a string that contains the letters of the custom alphabet, enumerated (ordered) from smallest to highest. For example:

const alphabet = "bca"

Let's create a map from this alphabet, which will tell the weight or order of each letter of our custom alphabet:

var weights = map[rune]int{}

func init() {
    for i, r := range alphabet {
        weights[r] = i
    }
}

(Note: i in the above loop is the byte index, not the rune index, but since both are monotone increasing, both will do just fine for rune weight.)

Now we can create our less() function. To have "acceptable" performance, we should avoid converting the input string values to byte or rune slices. To do that, we can call aid from the utf8.DecodeRuneInString() function which decodes the first rune of a string.

So we do the comparison rune-by-rune. If both runes are letters of the custom alphabet, we may use their weights to tell how they compare to each other. If at least one of the runes are not from our custom alphabet, we will fallback to simple numeric rune comparisons.

If 2 runes at the beginning of the 2 input strings are equal, we proceed to the next runes in each input string. We may do this my slicing the input strings: slicing them does not make a copy, it just returns a new string header that points to the data of the original strings.

All right, now let's see the implementation of this less() function:

func less(s1, s2 string) bool {
    for {
        switch e1, e2 := len(s1) == 0, len(s2) == 0; {
        case e1 && e2:
            return false // Both empty, they are equal (not less)
        case !e1 && e2:
            return false // s1 not empty but s2 is: s1 is greater (not less)
        case e1 && !e2:
            return true // s1 empty but s2 is not: s1 is less
        }

        r1, size1 := utf8.DecodeRuneInString(s1)
        r2, size2 := utf8.DecodeRuneInString(s2)

        // Check if both are custom, in which case we use custom order:
        custom := false
        if w1, ok1 := weights[r1]; ok1 {
            if w2, ok2 := weights[r2]; ok2 {
                custom = true
                if w1 != w2 {
                    return w1 < w2
                }
            }
        }
        if !custom {
            // Fallback to numeric rune comparison:
            if r1 != r2 {
                return r1 < r2
            }
        }

        s1, s2 = s1[size1:], s2[size2:]
    }
}

Let's see some trivial tests of this less() function:

pairs := [][2]string{
    {"b", "c"},
    {"c", "a"},
    {"b", "a"},
    {"a", "b"},
    {"bca", "bac"},
}
for _, pair := range pairs {
    fmt.Printf("\"%s\" < \"%s\" ? %t\n", pair[0], pair[1], less(pair[0], pair[1]))
}

Output (try it on the Go Playground):

"b" < "c" ? true
"c" < "a" ? true
"b" < "a" ? true
"a" < "b" ? false
"bca" < "bac" ? true

And now let's test this less() function in an actual sorting:

ss := []string{
    "abc",
    "abca",
    "abcb",
    "abcc",
    "bca",
    "cba",
    "bac",
}
sort.Slice(ss, func(i int, j int) bool {
    return less(ss[i], ss[j])
})
fmt.Println(ss)

Output (try it on the Go Playground):

[bca bac cba abc abcb abcc abca]

Again, if performance is important to you, you should not use sort.Slice() as that has to use reflection under the hood, but rather create your own slice type that implements sort.Interface, and in your implementation you can tell how to do it without using reflection.

This is how it could look like:

type CustStrSlice []string

func (c CustStrSlice) Len() int           { return len(c) }
func (c CustStrSlice) Less(i, j int) bool { return less(c[i], c[j]) }
func (c CustStrSlice) Swap(i, j int)      { c[i], c[j] = c[j], c[i] }

When you want to sort a string slice using the custom alphabet, simply convert your slice to CustStrSlice, so it can be passed directly to sort.Sort() (this type conversion does not make a copy of the slice or its elements, it just changes the type information):

ss := []string{
    "abc",
    "abca",
    "abcb",
    "abcc",
    "bca",
    "cba",
    "bac",
}
sort.Sort(CustStrSlice(ss))
fmt.Println(ss)

Output of the above is again (try it on the Go Playground):

[bca bac cba abc abcb abcc abca]

Some things to note:

The default string comparison compares strings byte-wise. That is, if the input strings contain invalid UTF-8 sequences, the actual bytes will still be used.

Our solution is different in this regard, as we decode runes (we have to because we use a custom alphabet in which we allow runes that are not necessarily mapped to bytes 1-to-1 in UTF-8 encoding). This means if the input is not a valid UTF-8 sequence, the behavior might not be consistent with the default ordering. But if your inputs are valid UTF-8 sequences, this will do what you expect it to do.

One last note:

We've seen how a string slice could be custom-sorted. If we have a slice of structs (or a slice of pointers of structs), the sorting algorithm (the less() function) may be the same, but when comparing elements of the slice, we have to compare fields of the elements, not the struct elements themselves.

So let's say we have the following struct:

type Person struct {
    Name string
    Age  int
}

func (p *Person) String() string { return fmt.Sprint(*p) }

(The String() method is added so we'll see the actual contents of the structs, not just their addresses...)

And let's say we want to apply our custom sorting on a slice of type []*Person, using the Name field of the Person elements. So we simply define this custom type:

type PersonSlice []*Person

func (p PersonSlice) Len() int           { return len(p) }
func (p PersonSlice) Less(i, j int) bool { return less(p[i].Name, p[j].Name) }
func (p PersonSlice) Swap(i, j int)      { p[i], p[j] = p[j], p[i] }

And that's all. The rest is the same, for example:

ps := []*Person{
    {Name: "abc"},
    {Name: "abca"},
    {Name: "abcb"},
    {Name: "abcc"},
    {Name: "bca"},
    {Name: "cba"},
    {Name: "bac"},
}
sort.Sort(PersonSlice(ps))
fmt.Println(ps)

Output (try it on the Go Playground):

[{bca 0} {bac 0} {cba 0} {abc 0} {abcb 0} {abcc 0} {abca 0}]
like image 177
icza Avatar answered Sep 29 '22 07:09

icza