Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Go: How to find out a rune's Unicode properties?

Tags:

unicode

go

I want to find out a rune's Unicode properties, particularly the value of its script property. Unicode has this to say (in http://www.unicode.org/reports/tr24/ Section 1.5):

The script property assigns a single value to each character, either
explicitly associating it with a particular script, or assigning one
of several specail [sic] values.

Go's unicode package provides me with a way to ask, "Is this rune in script x?", but has no way for me to ask, "In what script is this rune?". I could obviously iterate over all scripts, but that would be wasteful. Is there a cleverer way to find out a rune's script? (I could always implement a self-organising list, but I'm looking for something in the standard go libraries that already does what I want, and that I have overlooked.)

Thanks all!

like image 458
Stephan Neuhaus Avatar asked Feb 05 '23 19:02

Stephan Neuhaus


2 Answers

The easiest and quickest solution is to write the function. For example,

package main

import (
    "fmt"
    "unicode"
)

var runeScript map[rune]string

func init() {
    const nChar = 128172 // Version 9.0.0
    runeScript = make(map[rune]string, nChar*125/100)
    for s, rt := range unicode.Scripts {
        for _, r := range rt.R16 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = s
            }
        }
        for _, r := range rt.R32 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = s
            }
        }
    }
}

func script(r rune) string {
    return runeScript[r]
}

func main() {
    chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
    for _, c := range chars {
        s := script(c)
        fmt.Printf("%q %s\n", c, s)
    }
}

Output:

$ go run script.go
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
$ 
like image 85
peterSO Avatar answered Feb 16 '23 15:02

peterSO


Improving PeterSO's answer

PeterSO's answer is nice and clear. It doesn't go easy on memory usage though, as it stores more than a hundred thousand entries in a map, values being of string type. Even though a string value is just a header storing a pointer and a length (see reflect.StringHeader), having so many of them in a map is still multiple MB (like 6 MB)!

Since the number of possible different string values (the different script names) is small (137), we may opt to use a value type byte, which will just be an index in a slice storing the real script names.

This is how it could look like:

var runeScript map[rune]byte

var names = []string{""}

func init() {
    const nChar = 128172 // Version 9.0.0
    runeScript = make(map[rune]byte, nChar*125/100)
    for s, rt := range unicode.Scripts {
        idx := byte(len(names))
        names = append(names, s)
        for _, r := range rt.R16 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = idx
            }
        }
        for _, r := range rt.R32 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = idx
            }
        }
    }
}

func script(r rune) string {
    return names[runeScript[r]]
}

func main() {
    chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
    for _, c := range chars {
        s := script(c)
        fmt.Printf("%q %s\n", c, s)
    }
}

This simple improvement requires only one third of the memory compared to using map[rune]string. Output is the same (try it on the Go Playground):

' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul

Building merged range slices

Using map[rune]byte will result in like 2 MB of RAM usage, and it takes "some" time to build this map, which may or may not be acceptable.

There's another approach / solution. We may opt in to not build a map of "all" runes, but only store a slice of all ranges (actually 2 slices of ranges, one with 16-bit Unicode values, and another with 32-bit Unicode codepoints).

The benefit of this originates from the fact that the number of ranges is much less than the number of runes: only 852 (compared to 100,000+ runes). Memory usage of 2 slices having a total of 852 elements will be negligible compared to solution #1.

In our ranges we also store the script (name), so we can return this info. We could also store only a name index (as in solution #1), but since we only have 852 ranges, it's not worth it.

We'll sort the range slices, so we can use binary search in it (~400 elements in a slice, binary search: we get the result in like 7 steps max, and worst case repeating binary search on both: 15 steps).

Ok, so let's see. We're using these range wrappers:

type myR16 struct {
    r16    unicode.Range16
    script string
}

type myR32 struct {
    r32    unicode.Range32
    script string
}

And store them in:

var allR16 = []*myR16{}
var allR32 = []*myR32{}

We initialize / fill them like this:

func init() {
    for script, rt := range unicode.Scripts {
        for _, r16 := range rt.R16 {
            allR16 = append(allR16, &myR16{r16, script})
        }
        for _, r32 := range rt.R32 {
            allR32 = append(allR32, &myR32{r32, script})
        }
    }

    // sort
    sort.Slice(allR16, func(i int, j int) bool {
        return allR16[i].r16.Lo < allR16[j].r16.Lo
    })
    sort.Slice(allR32, func(i int, j int) bool {
        return allR32[i].r32.Lo < allR32[j].r32.Lo
    })
}

And finally the search in the sorted range slices:

func script(r rune) string {
    // binary search over ranges
    if r <= 0xffff {
        r16 := uint16(r)
        i := sort.Search(len(allR16), func(i int) bool {
            return allR16[i].r16.Hi >= r16
        })

        if i < len(allR16) && allR16[i].r16.Lo <= r16 && r16 <= allR16[i].r16.Hi {
            return allR16[i].script
        }
    }

    r32 := uint32(r)
    i := sort.Search(len(allR32), func(i int) bool {
        return allR32[i].r32.Hi >= r32
    })

    if i < len(allR32) && allR32[i].r32.Lo <= r32 && r32 <= allR32[i].r32.Hi {
        return allR32[i].script
    }

    return ""
}

Note: the Stride is always 1 in all scripts in the unicode package, which I took advantage of (and did not include it in the algorithm).

Testing with the same code, we get the same output. Try it on the Go Playground.

like image 26
icza Avatar answered Feb 16 '23 14:02

icza