Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why '===' is slower than comparing char by char when comparing two string in Nodejs

I found that in Nodejs comparing two strings by comparing every char of them is faster than using the statement 'str1 === str2'. What is the reason for this? And in browsers, it's just opposite.

Here is the code that I had tried, the two long strings are equal.Node version is v8.11.3

function createConstantStr(len) {
  let str = "";
  for (let i = 0; i < len; i++) {
    str += String.fromCharCode((i % 54) + 68);
  }

  return str;
}

let str = createConstantStr(1000000);
let str2 = createConstantStr(1000000);

console.time('equal')
console.log(str === str2);
console.timeEnd('equal')

console.time('equal by char')
let flag = true;
for (let i = 0; i < str.length; i++) {
  if (str[i] !== str2[i]) {
    flag = false;
    break;
  }
}

console.log(flag);
console.timeEnd('equal by char');
like image 496
YDSS Avatar asked May 01 '19 06:05

YDSS


2 Answers

It's been already pointed out to you that if you flip your two tests around then comparing with === will be faster than comparing char by char. The explanations you've gotten so far as to why have not precisely circumscribed why that is. There are a few issues that affect your results.

The first console.log call is expensive

If I try this:

console.time("a");
console.log(1 + 2);
console.timeEnd("a");

console.time("b");
console.log("foo");
console.timeEnd("b");

I get something like:

3
a: 3.864ms
foo
b: 0.050ms

If I flip code around so that I have this:

console.time("b");
console.log("foo");
console.timeEnd("b");

console.time("a");
console.log(1 + 2);
console.timeEnd("a");

Then I get something like this:

foo
b: 3.538ms
3
a: 0.330ms

If I modify the code by adding a console.log before any timing is taken, like this:

console.log("start");

console.time("a");
console.log(1 + 2);
console.timeEnd("a");

console.time("b");
console.log("foo");
console.timeEnd("b");

Then I get something like:

start
3
a: 0.422ms
foo
b: 0.027ms

By putting a console.log before starting the timings, I'm excluding the initial cost of calling console.log from the timings.

The way you have set up your test, the first console.log call is done by which ever of the === or char-by-char test comes first and this the cost of the first console.log call is added to that test. Whichever test comes second does not bear that cost. Ultimately, for a test like this, I'd rather move console.log outside the region that is being timed. For instance the first timed region could be written like this:

console.time('equal');
const result1 = str === str2;
console.timeEnd('equal');
console.log(result1);

Storing the result in result1 and then using console.log(result1) outside the timed region ensures that you can see the result while at the same time not counting the cost incurred by console.log.

Whichever of your tests comes first bears the cost of flattening the string trees internally created by v8

Node uses the v8 JavaScript engine for running your JavaScript. v8 implements a string in multiple ways. objects.h shows in a comment the class hierarchy that v8 supports. Here is the section relevant to strings:

//       - String
//         - SeqString
//           - SeqOneByteString
//           - SeqTwoByteString
//         - SlicedString
//         - ConsString
//         - ThinString
//         - ExternalString
//           - ExternalOneByteString
//           - ExternalTwoByteString
//         - InternalizedString
//           - SeqInternalizedString
//             - SeqOneByteInternalizedString
//             - SeqTwoByteInternalizedString
//           - ConsInternalizedString
//           - ExternalInternalizedString
//             - ExternalOneByteInternalizedString
//             - ExternalTwoByteInternalizedString

There are two classes important for our discussion: SeqString and ConsString. They differ in how they store the string in memory. The SeqString class is a straightforward implementation: the string is simply an array of characters. (Actually SeqString itself is abstract. The real classes are SeqOneByteString and SeqTwoByteString but that's not important here.) ConsString however stores a string as a binary tree. A ConcString has a first field and a second field which are pointers to other strings.

Consider this code:

let str = "";
for (let i = 0; i < 10; ++i) {
  str += i;
}
console.log(str);

If v8 used SeqString to implement the code above, then:

  • At iteration 0, it would have to allocate a new string of size 1, copy to it the old value of str ("") and append to that "0" and set str to the new string ("0").

  • At iteration 1, it would have to allocate a new string of size 2, copy to it the old value of str ("0") and append to that "1") and set str to the new string ("01").

  • ...

  • At iteration 9, it would have to allocate a new string of size 10, copy to it the old value of str ("012345678") and append to that "9" and set str to the new string ("0123456789").

The total number of characters copied for the 10 steps is 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 = 55 characters. 55 characters moved around for a string that in the end contains 10 characters.

Instead v8 actually uses ConsString like this:

  • At iteration 0, allocate a new ConcString with first set to the old value of str, and second set to i (as a string) 0, and set str to this new ConcString that was just allocated.

  • At iteration 1, allocate a new ConcString with first set to the old value of str, and second set to "1", and set str to this new ConcString that was just allocated.

  • ...

  • At iteration 9, allocate a new ConcString with first set to the old value of str, and second set to "9".

If we represent each ConcString as (<first>, <second>) where <first> is the contents of its first field and <second> is the content of the second field, then the end result is this:

(((((((((("", "0"), "1"), "2"), "3"), "4"), "5"), "6"), "7"), "8"), "9")

By doing things this way v8 avoids having to copy strings over and over again from step to step. Each step is just one allocation and adjusting a couple of pointers. While storing strings as a tree helps make concatenations fast, it has a downside in that other operations become slower. v8 mitigates this by flattening ConsString trees. After flattening the example above, it becomes:

("0123456789", "")

Note that when a ConsString is flattened, this very ConsString object is mutated. (From the standpoint of the JS code, the string remains the same. Only its internal v8 representation has changed.) It is easier to compare flattened ConsString trees, and indeed this is exactly what v8 does (ref):

bool String::Equals(Isolate* isolate, Handle<String> one, Handle<String> two) {
  if (one.is_identical_to(two)) return true;
  if (one->IsInternalizedString() && two->IsInternalizedString()) {
    return false;
  }
  return SlowEquals(isolate, one, two);
}

The strings we're talking about are not internalized so SlowEquals is called (ref):

bool String::SlowEquals(Isolate* isolate, Handle<String> one,
                        Handle<String> two) {
[... some shortcuts are attempted ...]
  one = String::Flatten(isolate, one);
  two = String::Flatten(isolate, two);

I've shown here that comparing strings for equality flattens them internally, but calls to String::Flatten are found in many other places. Both of your tests end up flattening the strings, through different means.

For your code, the upshot is this:

  1. Your createConstantStr creates strings which are internally stored as a ConsString. So str and str2 are ConsString objects, as far as v8 is concerned.

  2. The first test you run causes str and str2 to be flattened so: a) this test has to carry the cost of flattening the strings, b) the second test benefits from working with ConcString objects that are already flattened. (Remember, when a ConcString object is flattened, this very object is mutated. So if it is accessed again later, it is already flattened.)

like image 92
Louis Avatar answered Oct 18 '22 00:10

Louis


I reversed the comparison operation and looks like 0 ms (sometimes 1 ms) on === (firefox). So probably something to do with compiler internals trying to optimize. Something like, hey the strings are the same in the second comparison operation and I've already compared them. So I'll re-use the result.

This youtube video explains best.

function createConstantStr(len) {
  let str = "";
  for (let i = 0; i < len; i++) {
    str += String.fromCharCode((i % 54) + 68);
  }

  return str;
}

let str = createConstantStr(1000000);
let str2 = createConstantStr(1000000);

console.time('equal by char')
let flag = true;
for (let i = 0; i < str.length; i++) {
  if (str[i] !== str2[i]) {
    flag = false;
    break;
  }
}

console.log(flag);
console.timeEnd('equal by char');

console.time('equal')
console.log(str === str2);
console.timeEnd('equal')
like image 4
1565986223 Avatar answered Oct 18 '22 00:10

1565986223