I am parsing words from some sentences and my \w+
regexp works fine with Latin characters. However, it totally fails with some Cyrillic characters.
Here is a sample app:
package main
import (
"fmt"
"regexp"
)
func get_words_from(text string) []string {
words := regexp.MustCompile("\\w+")
return words.FindAllString(text, -1)
}
func main() {
text := "One, two three!"
text2 := "Раз, два три!"
text3 := "Jedna, dva tři čtyři pět!"
fmt.Println(get_words_from(text))
fmt.Println(get_words_from(text2))
fmt.Println(get_words_from(text3))
}
It yields the following results:
[One two three]
[]
[Jedna dva t i ty i p t]
It returns empty values for Russian, and extra syllables for Czech. I have no idea how to solve this issue. Could someone give me a piece of advice?
Or maybe there is a better way to split a sentence into words without punctuation?
The \w
shorthand class only matches ASCII letters in GO regex, thus, you need a Unicode character class \p{L}
.
\w
word characters (==[0-9A-Za-z_]
)
Use a character class to include the digits and underscore:
regexp.MustCompile("[\\p{L}\\d_]+")
Output of the demo:
[One two three]
[Раз два три]
[Jedna dva tři čtyři pět]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With