I've a little problem.
I'm using NodeJS as backend. Now, an user has a field "biography", where the user can write something about himself.
Suppose that this field has 220 maxlength, and suppose this as input:
๐ถ๐ป๐ฆ๐ป๐ง๐ป๐จ๐ป๐ฉ๐ป๐ฑ๐ปโโ๏ธ๐ฑ๐ป๐ด๐ป๐ต๐ป๐ฒ๐ป๐ณ๐ปโโ๏ธ๐ณ๐ป๐ฎ๐ปโโ๏ธ๐ฎ๐ป๐ท๐ปโโ๏ธ๐ท๐ป๐๐ปโโ๏ธ๐๐ป๐ต๐ปโโ๏ธ๐ฉ๐ปโโ๏ธ๐จ๐ปโโ๏ธ๐ฉ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ๐จ๐ปโ๐พ
As you can see there aren't 220 emojis (there are 37 emojis), but if I do in my nodejs server
console.log(bio.length)
where bio is the input text, I got 221. How could I "parse" the string input to get the correct length? Is it a problem about unicode?
SOLVED
I used this library: https://github.com/orling/grapheme-splitter
I tried that:
var Grapheme = require('grapheme-splitter'); var splitter = new Grapheme(); console.log(splitter.splitGraphemes(bio).length);
and the length is 37. It works very well!
As you know, the best way to find the length of a string is by using the strlen() function.
The length function in Javascript is used to return the length of an object. And since length is a property of an object it can be used on both arrays and strings.
> Most of the emoji are 3-byte Unicode characters. The most recent Emoji standard has 1,182 characters classified as Emoji and 179 of them are in the BMP [1]. Others are encoded as 4 bytes in any UTF encodings.
str.length
gives the count of UTF-16 units.
Unicode-proof way to get string length in codepoints (in characters) is [...str].length
as iterable protocol splits the string to codepoints.
If we need the length in graphemes (grapheme clusters), we have these native ways:
a. Unicode property escapes in RegExp. See for example: Unicode-aware version of \w or Matching emoji.
b. Intl.Segmenter โ coming soon, probably in ES2021. Can be tested with a flag in the last V8 versions (realization was synced with the last spec in V8 86). Unflagged (shipped) in V8 87.
See also:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
What every JavaScript developer should know about Unicode
JavaScript has a Unicode problem
Unicode-aware regular expressions in ES2015
ES6 Strings (and Unicode, โค) in Depth
JavaScript for impatient programmers. Unicode โ a brief introduction
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With