In my JavaScript code I need to compose a message to server in this format: <pre class="prettyprint"><code><size in bytes>CRLF <data>CRLF </code></pre> Example: <pre class="prettyprint"><code>3 foo </code></pre> The data may contain unicode characters. I need to send them as UTF-8. I'm looking for the most cross-browser way to calculate the length of the string in bytes in JavaScript. I've tried this to compose my payload: <pre class="prettyprint"><code>return unescape(encodeURIComponent(str)).length + "\n" + str + "\n" </code></pre> But it does not give me accurate results for the older browsers (or, maybe the strings in those browsers in UTF-16?). Any clues? Update: Example: length in bytes of the string <code>ЭЭХ! Naïve?</code> in UTF-8 is 15 bytes, but some browsers report 23 bytes instead.

Years passed and nowadays you can do it natively <pre class="prettyprint"><code>(new TextEncoder().encode('foo')).length </code></pre> Note that it's not supported by IE (you may use a polyfill for that). MDN documentation Standard specifications

Here is a much faster version, which doesn't use regular expressions, nor encodeURIComponent(): <pre class="prettyprint"><code>function byteLength(str) { // returns the byte length of an utf8 string var s = str.length; for (var i=str.length-1; i>=0; i--) { var code = str.charCodeAt(i); if (code > 0x7f && code <= 0x7ff) s++; else if (code > 0x7ff && code <= 0xffff) s+=2; if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate } return s; } </code></pre> Here is a performance comparison. It just computes the length in UTF8 of each unicode codepoints returned by charCodeAt() (based on wikipedia's descriptions of UTF8, and UTF16 surrogate characters). It follows RFC3629 (where UTF-8 characters are at most 4-bytes long).

For simple UTF-8 encoding, with slightly better compatibility than <code>TextEncoder</code>, Blob does the trick. Won't work in very old browsers though. <pre class="prettyprint"><code>new Blob(["😀"]).size; // -> 4 </code></pre>

Another very simple approach using <code>Buffer</code> (only for NodeJS): <pre class="prettyprint"><code>Buffer.byteLength(string, 'utf8') Buffer.from(string).length </code></pre>

This function will return the byte size of any UTF-8 string you pass to it. <pre class="prettyprint lang-js prettyprint-override"><code>function byteCount(s) { return encodeURI(s).split(/%..|./).length - 1; } </code></pre> Source

Took me a while to find a solution for React Native so I'll put it here: First install the <code>buffer</code> package: <pre class="prettyprint"><code>npm install --save buffer </code></pre> Then user the node method: <pre class="prettyprint"><code>const { Buffer } = require('buffer'); const length = Buffer.byteLength(string, 'utf-8'); </code></pre>

Actually, I figured out what's wrong. For the code to work the page <code><head></code> should have this tag: <pre class="prettyprint"><code><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </code></pre> Or, as suggested in comments, if server sends HTTP <code>Content-Encoding</code> header, it should work as well. Then results from different browsers are consistent. Here is an example: <pre class="prettyprint"><code><html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>mini string length test</title> </head> <body> <script type="text/javascript"> document.write('<div style="font-size:100px">' + (unescape(encodeURIComponent("ЭЭХ! Naïve?")).length) + '</div>' ); </script> </body> </html> </code></pre> Note: I suspect that specifying any (accurate) encoding would fix the encoding problem. It is just a coincidence that I need UTF-8.

In NodeJS, <code>Buffer.byteLength</code> is a method specifically for this purpose: <pre class="prettyprint"><code>let strLengthInBytes = Buffer.byteLength(str); // str is UTF-8 </code></pre> Note that by default the method assumes the string is in UTF-8 encoding. If a different encoding is required, pass it as the second argument.

Here is an independent and efficient method to count UTF-8 bytes of a string. <div class="snippet" data-lang="js" data-hide="false"> <div class="snippet-code"> <pre class="prettyprint snippet-code-js lang-js prettyprint-override"><code>//count UTF-8 bytes of a string function byteLengthOf(s){ //assuming the String is UCS-2(aka UTF-16) encoded var n=0; for(var i=0,l=s.length; i<l; i++){ var hi=s.charCodeAt(i); if(hi<0x0080){ //[0x0000, 0x007F] n+=1; }else if(hi<0x0800){ //[0x0080, 0x07FF] n+=2; }else if(hi<0xD800){ //[0x0800, 0xD7FF] n+=3; }else if(hi<0xDC00){ //[0xD800, 0xDBFF] var lo=s.charCodeAt(++i); if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF] n+=4; }else{ throw new Error("UCS-2 String malformed"); } }else if(hi<0xE000){ //[0xDC00, 0xDFFF] throw new Error("UCS-2 String malformed"); }else{ //[0xE000, 0xFFFF] n+=3; } } return n; } var s="\u0000\u007F\u07FF\uD7FF\uDBFF\uDFFF\uFFFF"; console.log("expect byteLengthOf(s) to be 14, actually it is %s.",byteLengthOf(s));</code></pre> </div> </div> Note that the method may throw error if an input string is UCS-2 malformed

String length in bytes in JavaScript

Tags:

javascript

unicode

In my JavaScript code I need to compose a message to server in this format:

<size in bytes>CRLF
<data>CRLF

Example:

3
foo

The data may contain unicode characters. I need to send them as UTF-8.

I'm looking for the most cross-browser way to calculate the length of the string in bytes in JavaScript.

I've tried this to compose my payload:

return unescape(encodeURIComponent(str)).length + "\n" + str + "\n"

But it does not give me accurate results for the older browsers (or, maybe the strings in those browsers in UTF-16?).

Any clues?

Update:

Example: length in bytes of the string ЭЭХ! Naïve? in UTF-8 is 15 bytes, but some browsers report 23 bytes instead.

304

asked Apr 01 '11 15:04

Alexander Gladysh

12 Answers

Years passed and nowadays you can do it natively

(new TextEncoder().encode('foo')).length

Note that it's not supported by IE (you may use a polyfill for that).

MDN documentation

Standard specifications

answered Sep 30 '22 19:09

Riccardo Galli

~~There is no way to do it in JavaScript natively.~~ (See Riccardo Galli's answer for a modern approach.)

For historical reference or where TextEncoder APIs are still unavailable.

If you know the character encoding, you can calculate it yourself though.

encodeURIComponent assumes UTF-8 as the character encoding, so if you need that encoding, you can do,

function lengthInUtf8Bytes(str) {
  // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
  var m = encodeURIComponent(str).match(/%[89ABab]/g);
  return str.length + (m ? m.length : 0);
}

This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.

The table in wikipedia makes it clearer

Bits        Last code point Byte 1          Byte 2          Byte 3
  7         U+007F          0xxxxxxx
 11         U+07FF          110xxxxx        10xxxxxx
 16         U+FFFF          1110xxxx        10xxxxxx        10xxxxxx
...

If instead you need to understand the page encoding, you can use this trick:

function lengthInPageEncoding(s) {
  var a = document.createElement('A');
  a.href = '#' + s;
  var sEncoded = a.href;
  sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1);
  var m = sEncoded.match(/%[0-9a-f]{2}/g);
  return sEncoded.length - (m ? m.length * 2 : 0);
}

answered Sep 30 '22 20:09

Mike Samuel

Here is a much faster version, which doesn't use regular expressions, nor encodeURIComponent():

function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

Here is a performance comparison.

It just computes the length in UTF8 of each unicode codepoints returned by charCodeAt() (based on wikipedia's descriptions of UTF8, and UTF16 surrogate characters).

It follows RFC3629 (where UTF-8 characters are at most 4-bytes long).

answered Sep 30 '22 18:09

lovasoa

For simple UTF-8 encoding, with slightly better compatibility than TextEncoder, Blob does the trick. Won't work in very old browsers though.

new Blob(["😀"]).size; // -> 4

answered Sep 30 '22 18:09

simap

Another very simple approach using Buffer (only for NodeJS):

Buffer.byteLength(string, 'utf8')

Buffer.from(string).length

answered Sep 30 '22 20:09

Iván Pérez

This function will return the byte size of any UTF-8 string you pass to it.

function byteCount(s) {
    return encodeURI(s).split(/%..|./).length - 1;
}

Source

answered Sep 30 '22 19:09

Lauri Oherd

I compared some of the methods suggested here in Firefox for speed.

The string I used contained the following characters: 😀œ´®†¥¨ˆøπ¬˚∆˙©ƒ∂ßåΩ≈ç√∫˜µ≤

All results are averages of 3 runs each. Times are in milliseconds. Note that all URIEncoding methods behaved similarly and had extreme results, so I only included one.

While there are some fluctuations based on the size of the string, the charCode methods (lovasoa and fuweichin) both perform similarly and the fastest overall, with fuweichin's charCode method the fastest. The Blob and TextEncoder methods performed similarly to each other. Generally the charCode methods were about 75% faster than the Blob and TextEncoder methods. The URIEncoding method was basically unacceptable.

Here are the results I got:

Size 6.4 * 10^6 bytes:

Lauri Oherd – URIEncoding:     6400000    et: 796
lovasoa – charCode:            6400000    et: 15
fuweichin – charCode2:         6400000    et: 16
simap – Blob:                  6400000    et: 26
Riccardo Galli – TextEncoder:  6400000    et: 23

Size 19.2 * 10^6 bytes: Blob does kind of a weird thing here.

Lauri Oherd – URIEncoding:     19200000    et: 2322
lovasoa – charCode:            19200000    et: 42
fuweichin – charCode2:         19200000    et: 45
simap – Blob:                  19200000    et: 169
Riccardo Galli – TextEncoder:  19200000    et: 70

Size 64 * 10^6 bytes:

Lauri Oherd – URIEncoding:     64000000    et: 12565
lovasoa – charCode:            64000000    et: 138
fuweichin – charCode2:         64000000    et: 133
simap – Blob:                  64000000    et: 231
Riccardo Galli – TextEncoder:  64000000    et: 211

Size 192 * 10^6 bytes: URIEncoding methods freezes browser at this point.

lovasoa – charCode:            192000000    et: 754
fuweichin – charCode2:         192000000    et: 480
simap – Blob:                  192000000    et: 701
Riccardo Galli – TextEncoder:  192000000    et: 654

Size 640 * 10^6 bytes:

lovasoa – charCode:            640000000    et: 2417
fuweichin – charCode2:         640000000    et: 1602
simap – Blob:                  640000000    et: 2492
Riccardo Galli – TextEncoder:  640000000    et: 2338

Size 1280 * 10^6 bytes: Blob & TextEncoder methods are starting to hit the wall here.

lovasoa – charCode:            1280000000    et: 4780
fuweichin – charCode2:         1280000000    et: 3177
simap – Blob:                  1280000000    et: 6588
Riccardo Galli – TextEncoder:  1280000000    et: 5074

Size 1920 * 10^6 bytes:

lovasoa – charCode:            1920000000    et: 7465
fuweichin – charCode2:         1920000000    et: 4968
JavaScript error: file:///Users/xxx/Desktop/test.html, line 74: NS_ERROR_OUT_OF_MEMORY:

Here is the code:

function byteLengthURIEncoding(str) {
  return encodeURI(str).split(/%..|./).length - 1;
}

function byteLengthCharCode(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

function byteLengthCharCode2(s){
  //assuming the String is UCS-2(aka UTF-16) encoded
  var n=0;
  for(var i=0,l=s.length; i<l; i++){
    var hi=s.charCodeAt(i);
    if(hi<0x0080){ //[0x0000, 0x007F]
      n+=1;
    }else if(hi<0x0800){ //[0x0080, 0x07FF]
      n+=2;
    }else if(hi<0xD800){ //[0x0800, 0xD7FF]
      n+=3;
    }else if(hi<0xDC00){ //[0xD800, 0xDBFF]
      var lo=s.charCodeAt(++i);
      if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
        n+=4;
      }else{
        throw new Error("UCS-2 String malformed");
      }
    }else if(hi<0xE000){ //[0xDC00, 0xDFFF]
      throw new Error("UCS-2 String malformed");
    }else{ //[0xE000, 0xFFFF]
      n+=3;
    }
  }
  return n;
}

function byteLengthBlob(str) {
  return new Blob([str]).size;
}

function byteLengthTE(str) {
  return (new TextEncoder().encode(str)).length;
}

var sample = "😀œ´®†¥¨ˆøπ¬˚∆˙©ƒ∂ßåΩ≈ç√∫˜µ≤i";
var string = "";

// Adjust multiplier to change length of string.
let mult = 1000000;

for (var i = 0; i < mult; i++) {
  string += sample;
}

let t0;

try {
  t0 = Date.now();
  console.log("Lauri Oherd – URIEncoding:   " + byteLengthURIEncoding(string) + "    et: " + (Date.now() - t0));
} catch(e) {}

t0 = Date.now();
console.log("lovasoa – charCode:            " + byteLengthCharCode(string) + "    et: " + (Date.now() - t0));

t0 = Date.now();
console.log("fuweichin – charCode2:         " + byteLengthCharCode2(string) + "    et: " + (Date.now() - t0));

t0 = Date.now();
console.log("simap – Blob:                  " + byteLengthBlob(string) + "    et: " + (Date.now() - t0));

t0 = Date.now();
console.log("Riccardo Galli – TextEncoder:  " + byteLengthTE(string) + "    et: " + (Date.now() - t0));

answered Sep 30 '22 20:09

KevinHJ

Took me a while to find a solution for React Native so I'll put it here:

First install the buffer package:

npm install --save buffer

Then user the node method:

const { Buffer } = require('buffer');
const length = Buffer.byteLength(string, 'utf-8');

answered Sep 30 '22 18:09

laurent

Actually, I figured out what's wrong. For the code to work the page <head> should have this tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Or, as suggested in comments, if server sends HTTP Content-Encoding header, it should work as well.

Then results from different browsers are consistent.

Here is an example:

<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
  <title>mini string length test</title>
</head>
<body>

<script type="text/javascript">
document.write('<div style="font-size:100px">' 
    + (unescape(encodeURIComponent("ЭЭХ! Naïve?")).length) + '</div>'
  );
</script>
</body>
</html>

Note: I suspect that specifying any (accurate) encoding would fix the encoding problem. It is just a coincidence that I need UTF-8.

answered Sep 30 '22 19:09

Alexander Gladysh

In NodeJS, Buffer.byteLength is a method specifically for this purpose:

let strLengthInBytes = Buffer.byteLength(str); // str is UTF-8

Note that by default the method assumes the string is in UTF-8 encoding. If a different encoding is required, pass it as the second argument.

answered Sep 30 '22 19:09

Boaz

Here is an independent and efficient method to count UTF-8 bytes of a string.

//count UTF-8 bytes of a string
function byteLengthOf(s){
	//assuming the String is UCS-2(aka UTF-16) encoded
	var n=0;
	for(var i=0,l=s.length; i<l; i++){
		var hi=s.charCodeAt(i);
		if(hi<0x0080){ //[0x0000, 0x007F]
			n+=1;
		}else if(hi<0x0800){ //[0x0080, 0x07FF]
			n+=2;
		}else if(hi<0xD800){ //[0x0800, 0xD7FF]
			n+=3;
		}else if(hi<0xDC00){ //[0xD800, 0xDBFF]
			var lo=s.charCodeAt(++i);
			if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
				n+=4;
			}else{
				throw new Error("UCS-2 String malformed");
			}
		}else if(hi<0xE000){ //[0xDC00, 0xDFFF]
			throw new Error("UCS-2 String malformed");
		}else{ //[0xE000, 0xFFFF]
			n+=3;
		}
	}
	return n;
}

var s="\u0000\u007F\u07FF\uD7FF\uDBFF\uDFFF\uFFFF";
console.log("expect byteLengthOf(s) to be 14, actually it is %s.",byteLengthOf(s));

Note that the method may throw error if an input string is UCS-2 malformed

answered Sep 30 '22 20:09

fuweichin

This would work for BMP and SIP/SMP characters.

    String.prototype.lengthInUtf8 = function() {
        var asciiLength = this.match(/[\u0000-\u007f]/g) ? this.match(/[\u0000-\u007f]/g).length : 0;
        var multiByteLength = encodeURI(this.replace(/[\u0000-\u007f]/g)).match(/%/g) ? encodeURI(this.replace(/[\u0000-\u007f]/g, '')).match(/%/g).length : 0;
        return asciiLength + multiByteLength;
    }

    'test'.lengthInUtf8();
    // returns 4
    '\u{2f894}'.lengthInUtf8();
    // returns 4
    'سلام علیکم'.lengthInUtf8();
    // returns 19, each Arabic/Persian alphabet character takes 2 bytes. 
    '你好，JavaScript 世界'.lengthInUtf8();
    // returns 26, each Chinese character/punctuation takes 3 bytes.

answered Sep 30 '22 18:09

chrislau

Related questions
                            
                                Why is arr = [] faster than arr = new Array?
                            
                                How to specify an array of objects as a parameter or return value in JSDoc?
                            
                                How to override Backbone.sync?
                            
                                JavaScript pattern for multiple constructors
                            
                                XMLHttpRequest cannot load XXX No 'Access-Control-Allow-Origin' header
                            
                                What's the difference between prettier-eslint, eslint-plugin-prettier and eslint-config-prettier?
                            
                                Vuex state on page refresh
                            
                                HTML5 check if audio is playing?
                            
                                Create JSON object dynamically via JavaScript (Without concate strings)
                            
                                Asynchronously load images with jQuery
                            
                                How is a non-breaking space represented in a JavaScript string?
                            
                                How to get the pure text without HTML element using JavaScript?
                            
                                How to perform a real time search and filter on a HTML table
                            
                                How to turn off caching on Firefox?
                            
                                Create RegExps on the fly using string variables
                            
                                How can I delete a query string parameter in JavaScript?
                            
                                How to check file input size with jQuery?
                            
                                How to call multiple functions with @click in vue?
                            
                                Cartesian product of multiple arrays in JavaScript
                            
                                How to get a number of random elements from an array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

String length in bytes in JavaScript

Tags:

javascript

unicode

Alexander Gladysh

People also ask

12 Answers

Riccardo Galli

Mike Samuel

lovasoa

simap

Iván Pérez

Lauri Oherd

KevinHJ

laurent

Alexander Gladysh

Boaz

fuweichin

chrislau

Recent Activity

Donate For Us