I want to find out a rune's Unicode properties, particularly the value of its script property. Unicode has this to say (in http://www.unicode.org/reports/tr24/ Section 1.5):
The script property assigns a single value to each character, either
explicitly associating it with a particular script, or assigning one
of several specail [sic] values.
Go's unicode
package provides me with a way to ask, "Is this rune in script x?", but has no way for me to ask, "In what script is this rune?". I could obviously iterate over all scripts, but that would be wasteful. Is there a cleverer way to find out a rune's script? (I could always implement a self-organising list, but I'm looking for something in the standard go libraries that already does what I want, and that I have overlooked.)
Thanks all!
The easiest and quickest solution is to write the function. For example,
package main
import (
"fmt"
"unicode"
)
var runeScript map[rune]string
func init() {
const nChar = 128172 // Version 9.0.0
runeScript = make(map[rune]string, nChar*125/100)
for s, rt := range unicode.Scripts {
for _, r := range rt.R16 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = s
}
}
for _, r := range rt.R32 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = s
}
}
}
}
func script(r rune) string {
return runeScript[r]
}
func main() {
chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
for _, c := range chars {
s := script(c)
fmt.Printf("%q %s\n", c, s)
}
}
Output:
$ go run script.go
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
$
PeterSO's answer is nice and clear. It doesn't go easy on memory usage though, as it stores more than a hundred thousand entries in a map, values being of string
type. Even though a string
value is just a header storing a pointer and a length (see reflect.StringHeader
), having so many of them in a map is still multiple MB (like 6 MB)!
Since the number of possible different string
values (the different script names) is small (137), we may opt to use a value type byte
, which will just be an index in a slice storing the real script names.
This is how it could look like:
var runeScript map[rune]byte
var names = []string{""}
func init() {
const nChar = 128172 // Version 9.0.0
runeScript = make(map[rune]byte, nChar*125/100)
for s, rt := range unicode.Scripts {
idx := byte(len(names))
names = append(names, s)
for _, r := range rt.R16 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = idx
}
}
for _, r := range rt.R32 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = idx
}
}
}
}
func script(r rune) string {
return names[runeScript[r]]
}
func main() {
chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
for _, c := range chars {
s := script(c)
fmt.Printf("%q %s\n", c, s)
}
}
This simple improvement requires only one third of the memory compared to using map[rune]string
. Output is the same (try it on the Go Playground):
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
Using map[rune]byte
will result in like 2 MB of RAM usage, and it takes "some" time to build this map, which may or may not be acceptable.
There's another approach / solution. We may opt in to not build a map of "all" runes, but only store a slice of all ranges (actually 2 slices of ranges, one with 16-bit Unicode values, and another with 32-bit Unicode codepoints).
The benefit of this originates from the fact that the number of ranges is much less than the number of runes: only 852 (compared to 100,000+ runes). Memory usage of 2 slices having a total of 852 elements will be negligible compared to solution #1.
In our ranges we also store the script (name), so we can return this info. We could also store only a name index (as in solution #1), but since we only have 852 ranges, it's not worth it.
We'll sort the range slices, so we can use binary search in it (~400 elements in a slice, binary search: we get the result in like 7 steps max, and worst case repeating binary search on both: 15 steps).
Ok, so let's see. We're using these range wrappers:
type myR16 struct {
r16 unicode.Range16
script string
}
type myR32 struct {
r32 unicode.Range32
script string
}
And store them in:
var allR16 = []*myR16{}
var allR32 = []*myR32{}
We initialize / fill them like this:
func init() {
for script, rt := range unicode.Scripts {
for _, r16 := range rt.R16 {
allR16 = append(allR16, &myR16{r16, script})
}
for _, r32 := range rt.R32 {
allR32 = append(allR32, &myR32{r32, script})
}
}
// sort
sort.Slice(allR16, func(i int, j int) bool {
return allR16[i].r16.Lo < allR16[j].r16.Lo
})
sort.Slice(allR32, func(i int, j int) bool {
return allR32[i].r32.Lo < allR32[j].r32.Lo
})
}
And finally the search in the sorted range slices:
func script(r rune) string {
// binary search over ranges
if r <= 0xffff {
r16 := uint16(r)
i := sort.Search(len(allR16), func(i int) bool {
return allR16[i].r16.Hi >= r16
})
if i < len(allR16) && allR16[i].r16.Lo <= r16 && r16 <= allR16[i].r16.Hi {
return allR16[i].script
}
}
r32 := uint32(r)
i := sort.Search(len(allR32), func(i int) bool {
return allR32[i].r32.Hi >= r32
})
if i < len(allR32) && allR32[i].r32.Lo <= r32 && r32 <= allR32[i].r32.Hi {
return allR32[i].script
}
return ""
}
Note: the Stride
is always 1 in all scripts in the unicode
package, which I took advantage of (and did not include it in the algorithm).
Testing with the same code, we get the same output. Try it on the Go Playground.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With