Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a single Unicode character from string

Tags:

string

unicode

go

I wonder how I can I get a Unicode character from a string. For example, if the string is "你好", how can I get the first character "你"?

From another place I get one way:

var str = "你好" runes := []rune(str) fmt.Println(string(runes[0])) 

It does work. But I still have some questions:

  1. Is there another way to do it?

  2. Why in Go does str[0] not get a Unicode character from a string, but it gets byte data?

like image 359
赵浩翔 Avatar asked May 15 '15 15:05

赵浩翔


People also ask

How do I get Unicode of a string in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

What is a Unicode character string?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

How do I identify Unicode characters?

Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. Note, that ASCII includes only the English alphabet.

Does string support Unicode?

Windows natively supports Unicode strings for UI elements, file names, and so forth. Unicode is the preferred character encoding, because it supports all character sets and languages. Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as one or two 16-bit values.


1 Answers

First, you may want to read https://blog.golang.org/strings It will answer part of your questions.

A string in Go can contains arbitrary bytes. When you write str[i], the result is a byte, and the index is always a number of bytes.

Most of the time, strings are encoded in UTF-8 though. You have multiple ways to deal with UTF-8 encoding in a string.

For instance, you can use the for...range statement to iterate on a string rune by rune.

var first rune for _,c := range str {     first = c     break } // first now contains the first rune of the string 

You can also leverage the unicode/utf8 package. For instance:

r, size := utf8.DecodeRuneInString(str) // r contains the first rune of the string // size is the size of the rune in bytes 

If the string is encoded in UTF-8, there is no direct way to access the nth rune of the string, because the size of the runes (in bytes) is not constant. If you need this feature, you can easily write your own helper function to do it (with for...range, or with the unicode/utf8 package).

like image 151
Didier Spezia Avatar answered Oct 11 '22 08:10

Didier Spezia