I ran into an issue with counting unicode characters. I need to count total combined unicode characters.
Take this character for example:
द्ध
if you use .length
property on this string it gives you 3. Which is technically correct as it is a combination of
द
, ्
and ध
However, put द्ध
in a text area and then you realize by using arrow keys that it is considered as one character. Only if you use backspace you realize that there are 3 characters.
Edit: Also for your test case please consider that it could be a word. It could be something like,
द्धद्द
This should give 2 with .length
, but gives 6
This is a problem when you want to get or set the current caret position in input elements.
Count characters in JavaScript using length property The most basic way is to use the . length property, which is available on all strings. This will return the number of characters in the string, including whitespace and other non-visible characters.
Popular encodings are UTF-8, UTF-16 and UTF-32. Most JavaScript engines use UTF-16 encoding, so let's detail into UTF-16.
JavaScript uses Unicode encoding for strings. Most characters are encoded with 2 bytes, but that allows to represent at most 65536 characters.
Unicode is a superset of ASCII and Latin-1 and supports virtually every written language currently used on the planet. ECMAScript 3 requires JavaScript implementations to support Unicode version 2.1 or later, and ECMAScript 5 requires implementations to support Unicode 3 or later.
Your example “द्ध” is a string of three Unicode characters, and the length
property correctly indicates this.
What you apparently to want to count is “characters” in some other sense, something like “what a speaker of a language intuitively sees as one character”. This is a vague and mutable concept. The Unicode standard annex UAX #29 Unicode Text Segmentation tries to analyze the concept, calling it “grapheme cluster”, and describes some algorithms on working with it.
Unfortunately, JavaScript has no built-in tools for recognizing whether a character is e.g. combining mark and this should be regarded as part of a cluster. However, if you can limit yourself to handling just one writing system, you can probably code the operations manually, referring to possible Unicode characters by their code numbers.
Moreover, if the intent is to make the count match the way some input editor works (e.g. how the arrow keys more over characters), you would need to know the logic of that editor. It may implement Unicode grapheme clusters in some sense, or something else.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With