Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Golang regexp with non-latin characters

Tags:

regex

go

I am parsing words from some sentences and my \w+ regexp works fine with Latin characters. However, it totally fails with some Cyrillic characters.

Here is a sample app:

package main

import (
    "fmt"
    "regexp"
)

func get_words_from(text string) []string {
    words := regexp.MustCompile("\\w+")
    return words.FindAllString(text, -1)
}

func main() {
    text := "One, two three!"
    text2 := "Раз, два три!"
    text3 := "Jedna, dva tři čtyři pět!"
    fmt.Println(get_words_from(text))
    fmt.Println(get_words_from(text2))
    fmt.Println(get_words_from(text3))
}

It yields the following results:

 [One two three]
 []
 [Jedna dva t i ty i p t]

It returns empty values for Russian, and extra syllables for Czech. I have no idea how to solve this issue. Could someone give me a piece of advice?

Or maybe there is a better way to split a sentence into words without punctuation?

like image 472
Keir Avatar asked May 27 '15 12:05

Keir


1 Answers

The \w shorthand class only matches ASCII letters in GO regex, thus, you need a Unicode character class \p{L}.

\w word characters (== [0-9A-Za-z_])

Use a character class to include the digits and underscore:

    regexp.MustCompile("[\\p{L}\\d_]+")

Output of the demo:

[One two three]
[Раз два три]
[Jedna dva tři čtyři pět]
like image 183
Wiktor Stribiżew Avatar answered Sep 21 '22 09:09

Wiktor Stribiżew