Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting substring without messing up UTF-8 string

I have a UTF-8 encoded string that comes from an ajax response, I want to get substring of that string up to the first comma. For the string "Привет, мир" it would be "Привет".

Will this work and not run into "multibyte-ness" issues?

var i = text.indexOf(',');
if (i != -1) text = text.substr(0, i);

Or is it better to use split?

like image 965
galymzhan Avatar asked May 24 '13 15:05

galymzhan


Video Answer


1 Answers

Javascript treats strings by characters, not by bytes.
As such, yes, that's fine from an encoding/string handling standpoint.
You may treat strings in Javascript as not having any particular encoding, but as a string of characters.

> "漢字".substr(1)
  "字"

Note that the above is only a simplification though. As pointed out in the comments, Javascript treats strings as 16-bit code points. This enables you to treat strings "by character" for the majority of common characters, but for characters which are encoded with more than 2 bytes in UTF-16 or characters composed of more than one code point, this abstraction breaks down.

like image 173
deceze Avatar answered Sep 29 '22 10:09

deceze