Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count the correct length of a string with emojis in javascript?

I've a little problem.

I'm using NodeJS as backend. Now, an user has a field "biography", where the user can write something about himself.

Suppose that this field has 220 maxlength, and suppose this as input:

👶🏻👦🏻👧🏻👨🏻👩🏻👱🏻‍♀️👱🏻👴🏻👵🏻👲🏻👳🏻‍♀️👳🏻👮🏻‍♀️👮🏻👷🏻‍♀️👷🏻💂🏻‍♀️💂🏻🕵🏻‍♀️👩🏻‍⚕️👨🏻‍⚕️👩🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾👨🏻‍🌾  

As you can see there aren't 220 emojis (there are 37 emojis), but if I do in my nodejs server

console.log(bio.length) 

where bio is the input text, I got 221. How could I "parse" the string input to get the correct length? Is it a problem about unicode?

SOLVED

I used this library: https://github.com/orling/grapheme-splitter

I tried that:

var Grapheme = require('grapheme-splitter'); var splitter = new Grapheme(); console.log(splitter.splitGraphemes(bio).length); 

and the length is 37. It works very well!

like image 246
Stackedo Avatar asked Jan 25 '19 16:01

Stackedo


People also ask

How do you check the length of a string?

As you know, the best way to find the length of a string is by using the strlen() function.

Which method is used to get length in JavaScript?

The length function in Javascript is used to return the length of an object. And since length is a property of an object it can be used on both arrays and strings.

How many bytes is a emoji?

> Most of the emoji are 3-byte Unicode characters. The most recent Emoji standard has 1,182 characters classified as Emoji and 179 of them are in the BMP [1]. Others are encoded as 4 bytes in any UTF encodings.


1 Answers

  1. str.length gives the count of UTF-16 units.

  2. Unicode-proof way to get string length in codepoints (in characters) is [...str].length as iterable protocol splits the string to codepoints.

  3. If we need the length in graphemes (grapheme clusters), we have these native ways:

    a. Unicode property escapes in RegExp. See for example: Unicode-aware version of \w or Matching emoji.

    b. Intl.Segmenter — coming soon, probably in ES2021. Can be tested with a flag in the last V8 versions (realization was synced with the last spec in V8 86). Unflagged (shipped) in V8 87.

See also:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

  • What every JavaScript developer should know about Unicode

  • JavaScript has a Unicode problem

  • Unicode-aware regular expressions in ES2015

  • ES6 Strings (and Unicode, ❤) in Depth

  • JavaScript for impatient programmers. Unicode – a brief introduction

like image 123
vsemozhebuty Avatar answered Sep 20 '22 17:09

vsemozhebuty