Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting unicode characters in Javascript [duplicate]

I ran into an issue with counting unicode characters. I need to count total combined unicode characters.

Take this character for example:

द्ध

if you use .length property on this string it gives you 3. Which is technically correct as it is a combination of

, and

However, put द्धin a text area and then you realize by using arrow keys that it is considered as one character. Only if you use backspace you realize that there are 3 characters.

Edit: Also for your test case please consider that it could be a word. It could be something like,

द्धद्द

This should give 2 with .length, but gives 6

This is a problem when you want to get or set the current caret position in input elements.

like image 735
pewpewlasers Avatar asked Aug 13 '14 17:08

pewpewlasers


People also ask

How do I count the number of letters in JavaScript?

Count characters in JavaScript using length property The most basic way is to use the . length property, which is available on all strings. This will return the number of characters in the string, including whitespace and other non-visible characters.

Does JavaScript use UTF 8?

Popular encodings are UTF-8, UTF-16 and UTF-32. Most JavaScript engines use UTF-16 encoding, so let's detail into UTF-16.

Does JavaScript use Unicode or Ascii?

JavaScript uses Unicode encoding for strings. Most characters are encoded with 2 bytes, but that allows to represent at most 65536 characters.

What is Unicode character set in JavaScript?

Unicode is a superset of ASCII and Latin-1 and supports virtually every written language currently used on the planet. ECMAScript 3 requires JavaScript implementations to support Unicode version 2.1 or later, and ECMAScript 5 requires implementations to support Unicode 3 or later.


1 Answers

Your example “द्ध” is a string of three Unicode characters, and the length property correctly indicates this.

What you apparently to want to count is “characters” in some other sense, something like “what a speaker of a language intuitively sees as one character”. This is a vague and mutable concept. The Unicode standard annex UAX #29 Unicode Text Segmentation tries to analyze the concept, calling it “grapheme cluster”, and describes some algorithms on working with it.

Unfortunately, JavaScript has no built-in tools for recognizing whether a character is e.g. combining mark and this should be regarded as part of a cluster. However, if you can limit yourself to handling just one writing system, you can probably code the operations manually, referring to possible Unicode characters by their code numbers.

Moreover, if the intent is to make the count match the way some input editor works (e.g. how the arrow keys more over characters), you would need to know the logic of that editor. It may implement Unicode grapheme clusters in some sense, or something else.

like image 53
Jukka K. Korpela Avatar answered Oct 18 '22 05:10

Jukka K. Korpela