Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is DFS slower in one tree and faster in the other?

UPDATE: Turns out there was a bug in the parser that generated the trees. More in Final edit.

Let T be a binary tree such that every internal node has exactly two children. For this tree, we want to code a function that for every node v in T finds the number of nodes in the subtree defined by v.

Example

Input

enter image description here

Desired output

enter image description here

With red I indicate the numbers that we want to compute. The nodes of the tree will be stored in an array, let us call it TreeArray by following the preorder layout.

For the example above, TreeArray will contain the following objects:

10, 11, 0, 12, 13, 2, 7, 3, 14, 1, 15, 16, 4, 8, 17, 18, 5, 9, 6

A node of the tree is described by the following struct:

struct tree_node{      long long int id; //id of the node, randomly generated     int numChildren; //number of children, it is 2 but for the leafs it's 0     int size; //size of the subtree rooted at the current node,     // what we want to compute      int pos; //position in TreeArray where the node is stored     int lpos; //position of the left child     int rpos; //position of the right child      tree_node(){         id = -1;         size = 1;         pos = lpos = rpos = -1;         numChildren = 0;     }  }; 

The function to compute all the size values is the following:

void testCache(int cur){      if(treeArray[cur].numChildren == 0){         treeArray[cur].size = 1;         return;     }      testCache(treeArray[cur].lpos);     testCache(treeArray[cur].rpos);      treeArray[cur].size = treeArray[treeArray[cur].lpos].size +      treeArray[treeArray[cur].rpos].size + 1;  } 

I would like to understand why this function is faster when T looks like this (almost like a left going chain):

enter image description here

and slower when T looks like this (almost like a right going chain):

enter image description here

The following experiments were run on Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz with 8 GB of RAM, L1 cache 256 KB, L2 cache 1 MB, L3 cache 6 MB.

Each dot in the graphs is the result of the following for loop (the parameters are defined by the axis):

for (int i = 0; i < 100; i++) {         testCache(0); } 

enter image description here

n corresponds to the total number of nodes, and time is measured in seconds. As we can see it is clear that as n grows the function is much faster when the tree looks like a left going chain, even though the number of nodes is exactly the same in both cases.

Now let us try to find where the bottleneck is. I used the PAPI library to count interesting hardware counters.

The first counter is the instructions, how many instructions do we actually spend? Is there a difference when the trees look different?

enter image description here

The difference is not significant. It looks like for large inputs the left going chain requires fewer instructions, but the difference is so small, so I think it is safe to assume that they both require the same number of instructions.

Seeing that we have stored the tree in a nice pre order layout inside treeArray it makes sense to see what is happening in cache. Unfortunately for L1 cache my computer does not provide any counters, but I have for L2 and L3.

Let's look at the accesses to L2 cache. The accesses to L2 cache happen when we get a miss in L1 cache, so that is an indirect counter for L1 misses as well.

enter image description here

As we can see the right going tree requires fewer L1 misses, so it seems that it uses the cache efficiently.

enter image description here

Same for L2 misses, the right going tree seems to be more efficient. Still nothing to indicate why the right going trees are so slower. Let's look at L3.

enter image description here

In L3 things explode for the right going trees. So the problem seems to be in L3 cache. Unfortunately I could not explain the reason behind this behavior. Why do things get messed up in L3 cache for the right going trees?

Here is the entire code together with the experiment:

#include <iostream> #include <fstream> #define BILLION  1000000000LL  using namespace std;   /*  *  * Timing functions  *  */  timespec startT, endT;  void startTimer(){     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT); }  double endTimer(){     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT);     return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec); }  /*  *  * tree node  *  */  //this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers  struct tree_node_temp{      long long int id; //id of the node, randomly generated     int numChildren; //number of children, it is 2 but for the leafs it's 0     int size; //size of the subtree rooted at the current node     tree_node_temp *leftChild;     tree_node_temp *rightChild;      tree_node_temp(){         id = -1;         size = 1;         leftChild = nullptr;         rightChild = nullptr;         numChildren = 0;     }  };  struct tree_node{      long long int id; //id of the node, randomly generated     int numChildren; //number of children, it is 2 but for the leafs it's 0     int size; //size of the subtree rooted at the current node      int pos; //position in TreeArray where the node is stored     int lpos; //position of the left child     int rpos; //position of the right child      tree_node(){         id = -1;         pos = lpos = rpos = -1;         numChildren = 0;     }  };  /*  *  * Tree parser. The input is a file containing the tree in the newick format.  *  */  string treeNewickStr; //string storing the newick format of a tree that we read from a file int treeCurSTRindex; //index to the current position we are in while reading the newick string int treeNumLeafs; //number of leafs in current tree tree_node ** treeArrayReferences; //stack of references to free node objects tree_node *treeArray; //array of node objects int treeStackReferencesTop; //the top index to the references stack int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree   //helper function for readNewick tree_node_temp* readNewickHelper() {      int i;     if(treeCurSTRindex == treeNewickStr.size())         return nullptr;      tree_node_temp * leftChild;     tree_node_temp * rightChild;      if(treeNewickStr[treeCurSTRindex] == '('){         //create a left child         treeCurSTRindex++;         leftChild = readNewickHelper();     }      if(treeNewickStr[treeCurSTRindex] == ','){         //create a right child         treeCurSTRindex++;         rightChild = readNewickHelper();     }      if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){         treeCurSTRindex++;         tree_node_temp * cur = new tree_node_temp();         cur->numChildren = 2;         cur->leftChild = leftChild;         cur->rightChild = rightChild;         cur->size = 1 + leftChild->size + rightChild->size;         return cur;     }      //we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format)     i = 0;     char treeLabel[20]; //buffer used for the label     while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){         treeLabel[i] = treeNewickStr[treeCurSTRindex];         treeCurSTRindex++;         i++;     }      treeLabel[i] = '\0';     tree_node_temp * cur = new tree_node_temp();     cur->numChildren = 0;     cur->id = atoi(treeLabel)-1;     treeNumLeafs++;      return cur; }  //create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser void treeInit(tree_node_temp * curRoot){      tree_node * curFinalRoot = treeArrayReferences[curpos];      curFinalRoot->pos = curpos;      if(curRoot->numChildren == 0) {         curFinalRoot->id = curRoot->id;         return;     }      //add left child     tree_node * cnode = treeArrayReferences[treeStackReferencesTop];     curFinalRoot->lpos = curpos + 1;     curpos = curpos + 1;     treeStackReferencesTop++;     cnode->id = curRoot->leftChild->id;     treeInit(curRoot->leftChild);      //add right child     curFinalRoot->rpos = curpos + 1;     curpos = curpos + 1;     cnode = treeArrayReferences[treeStackReferencesTop];     treeStackReferencesTop++;     cnode->id = curRoot->rightChild->id;     treeInit(curRoot->rightChild);      curFinalRoot->id = curRoot->id;     curFinalRoot->numChildren = 2;     curFinalRoot->size = curRoot->size;  }  //the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal void updateInternalNodeIDs(int cur){      tree_node* curNode = treeArrayReferences[cur];      if(curNode->numChildren == 0){         return;     }     curNode->id = treeNumLeafs++;     updateInternalNodeIDs(curNode->lpos);     updateInternalNodeIDs(curNode->rpos);  }  //frees the memory of the first tree generated by the parser void treeFreeMemory(tree_node_temp* cur){      if(cur->numChildren == 0){         delete cur;         return;     }     treeFreeMemory(cur->leftChild);     treeFreeMemory(cur->rightChild);      delete cur;  }  //reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree. //this tree is scattered anywhere in the memory.  tree_node* readNewick(string& file){      treeCurSTRindex = -1;     treeNewickStr = "";     treeNumLeafs = 0;      ifstream treeFin;      treeFin.open(file, ios_base::in);     //read the newick format of the tree and store it in a string     treeFin>>treeNewickStr;     //initialize index for reading the string     treeCurSTRindex = 0;     //create the tree in main memory     tree_node_temp* root = readNewickHelper();      //store the tree in an array following the pre order layout     treeArray = new tree_node[root->size];     treeArrayReferences = new tree_node*[root->size];     int i;     for(i=0;i<root->size;i++)         treeArrayReferences[i] = &treeArray[i];     treeStackReferencesTop = 0;      tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop];     curpos = treeStackReferencesTop;     treeStackReferencesTop++;     finalRoot->id = root->id;     treeInit(root);      //update the internal node ids (the leaf ids are defined by the ids stored in the newick string)     updateInternalNodeIDs(0);     //close the file     treeFin.close();      //free the memory of initial tree     treeFreeMemory(root);     //return the pre order tree     return finalRoot;  }  /*  *  *  * DOT FORMAT OUTPUT --- BEGIN  *  *  */  void treeBstPrintDotAux(tree_node* node, ofstream& treeFout) {      if(node->numChildren == 0) return;      treeFout<<"    "<<node->id<<" -> "<<treeArrayReferences[node->lpos]->id<<";\n";     treeBstPrintDotAux(treeArrayReferences[node->lpos], treeFout);      treeFout<<"    "<<node->id<<" -> "<<treeArrayReferences[node->rpos]->id<<";\n";     treeBstPrintDotAux(treeArrayReferences[node->rpos], treeFout);  }  void treePrintDotHelper(tree_node* cur, ofstream& treeFout){     treeFout<<"digraph BST {\n";     treeFout<<"    node [fontname=\"Arial\"];\n";      if(cur == nullptr){         treeFout<<"\n";     }     else if(cur->numChildren == 0){         treeFout<<"    "<<cur->id<<";\n";     }     else{         treeBstPrintDotAux(cur, treeFout);     }      treeFout<<"}\n"; }  void treePrintDot(string& file, tree_node* root){      ofstream treeFout;     treeFout.open(file, ios_base::out);     treePrintDotHelper(root, treeFout);     treeFout.close();  }  /*  *  *  * DOT FORMAT OUTPUT --- END  *  *  */  /*  * experiments  *  */  tree_node* T; int n;  void testCache(int cur){      if(treeArray[cur].numChildren == 0){         treeArray[cur].size = 1;         return;     }      testCache(treeArray[cur].lpos);     testCache(treeArray[cur].rpos);      treeArray[cur].size = treeArray[treeArray[cur].lpos].size + treeArray[treeArray[cur].rpos].size + 1;  }   int main(int argc, char* argv[]){      string Tnewick = argv[1];     T = readNewick(Tnewick);      n = T->size;     double tt;      startTimer();     for (int i = 0; i < 100; i++) {         testCache(0);     }      tt = endTimer();     cout << tt / BILLION << '\t' << T->size;     cout<<endl;      return 0; } 

Compile by typing g++ -O3 -std=c++11 file.cpp Run by typing ./executable tree.txt. In tree.txt we store the tree in the newick format.

Here is a left going tree with 10^5 leafs

Here is a right going tree with 10^5 leafs

The running times I get: ~0.07 seconds for the left going trees ~0.12 seconds for the right going trees

I apologize for the long post but given how narrow the problem seems to be, I couldn't find a better way to describe it.

Thank you in advance!

EDIT:

This is a follow up edit after MrSmith42's answer. I understand that locality plays a very big role, but I am not sure I understand that this is the case here.

For the two example trees above let us see how we access the memory over time.

For the left going tree:

enter image description here

For the right going tree:

enter image description here

To me it seems like in both cases we have locally access patterns.

EDIT:

Here is a plot about the number of conditional branches:

enter image description here

Here is a plot about the number of branch mispredictions:

enter image description here

Here is a left going tree with 10^6 leafs

Here is a right going tree with 10^6 leafs

FINAL EDIT:

I would like to apologize for wasting everyone's time, the parser I was using had a parameter for how "left" or "right" going I would like to make my tree look like. That was a floating number, it had to be close to 0 to make it left going and close to 1 to make it right going. However to make it look like a chain it had to be very small, like 0.000000001 or 0.999999999. For small inputs the tree looked like a chain even for values like 0.0001. I thought this number was small enough and that it would also give a chain for larger trees, however as I will show isn't the case. If you use numbers like 0.000000001 the parser stops working due to floating point problems.

vadikrobot's answer showed that we have locality issues. Inspired by his experiment I decided to generalize the access pattern diagram above to see how it behaves not just in the example trees, but in any trees.

I modified vadikrobot's code to look like this:

void testCache(int cur, FILE *f) {      if(treeArray[cur].numChildren == 0){         fprintf(f, "%d\t", tim++);         fprintf (f, "%d\n", cur);         treeArray[cur].size = 1;         return;     }      fprintf(f, "%d\t", tim++);     fprintf (f, "%d\n", cur);     testCache(treeArray[cur].lpos, f);     fprintf(f, "%d\t", tim++);     fprintf (f, "%d\n", cur);     testCache(treeArray[cur].rpos, f);     fprintf(f, "%d\t", tim++);     fprintf (f, "%d\n", cur);     fprintf(f, "%d\t", tim++);     fprintf (f, "%d\n", treeArray[cur].lpos);     fprintf(f, "%d\t", tim++);     fprintf (f, "%d\n", treeArray[cur].rpos);     treeArray[cur].size = treeArray[treeArray[cur].lpos].size +      treeArray[treeArray[cur].rpos].size + 1; } 

Access patterns generated by the wrong parser

Let's look at a left tree with 10 leafs.

enter image description here

Looks very nice, as predicted in the diagrams above (I only forgot in the above diagrams the fact that when we find the size of a node, we also access the size parameter of that node, cur in the source code above).

Let's look at a left tree with 100 leafs.

enter image description here

Looks as expected. What about 1000 leafs?

enter image description here

This is definitely not expected. There is a small triangle in the top right corner. And the reason for that is because the tree doesn't look like a left going chain, there is a small subtree hanging out somewhere in the end. The problem becomes even larger when the leafs are 10^4.

enter image description here

Let's look at what happens with right trees. When the leafs are 10:

enter image description here

Looks nice, what about 100 leafs?

enter image description here

Looks good too. This is why I questioned the locality of right trees, to me both seemed at least theory local. Now if you try increasing the size something interesting happens:

For 1000 leafs:

enter image description here

For 10^4 leafs things get even messier:

enter image description here

Access patterns generated by the correct parser

Instead of using that general parser I created one for this specific question:

#include <iostream> #include <fstream>  using namespace std;  int main(int argc, char* argv[]){      if(argc!=4){         cout<<"type ./executable n{number of leafs} type{l:left going, r:right going} outputFile"<<endl;         return 0;     }      int i;      int n = atoi(argv[1]);      if(n <= 2){cout<<"leafs must be at least 3"<<endl; return 0;}      char c = argv[2][0];      ofstream fout;     fout.open(argv[3], ios_base::out);      if(c == 'r'){          for(i=0;i<n-1;i++){              fout<<"("<<i<<",";          }         fout<<i;         for(i=0;i<n;i++){             fout<<")";         }         fout<<";"<<endl;      }     else{          for(i=0;i<n-1;i++){             fout<<"(";         }          fout<<1<<","<<n<<")";          for(i=n-1;i>1;i--){             fout<<","<<i<<")";         }         fout<<";"<<endl;      }      fout.close();   return 0; } 

Now the access patterns look as expected.

For left trees with 10^4 leafs:

enter image description here

in the black part we go from a low place to a high place, but the distance between the previous low and the current low is small, same for previous high and current high. Hence the cache must be smart enough to hold two blocks, one for the low places and one for the high places, giving a small amount of cache misses.

For right trees with 10^4 leafs:

enter image description here

The original experiments again. This time I could only try up to 10^5 leafs, because as Mysticial noticed, we will get a stack overflow because of the height of the trees, which wasn't the case in the previous experiments since the height was smaller than the one expected.

Time wise they seem to perform the same, however cache and branch wise not. The right trees beat the left trees in branch predictions, the left trees beat the right trees in cache.

Maybe my PAPI usage has been wrong, output from perf:

left trees:

enter image description here

right trees:

enter image description here

I might have messed something up again, and I apologize for this. I included my attempt here just in case anyone wants to continue investigating.

like image 565
jsguy Avatar asked Sep 12 '16 13:09

jsguy


People also ask

Does DFS take longer than BFS on a tree?

DFS is faster than BFS. Time Complexity of BFS = O(V+E) where V is vertices and E is edges. Time Complexity of DFS is also O(V+E) where V is vertices and E is edges.

Why is BFS so slow?

BFS, Breadth-First Search, is a vertex-based technique for finding the shortest path in the graph. It uses a Queue data structure that follows first in first out. In BFS, one vertex is selected at a time when it is visited and marked then its adjacent are visited and stored in the queue. It is slower than DFS.


2 Answers

UPDATE:

I plot the number of accessed element in array in time

void testCache(int cur, FILE *f) {    if(treeArray[cur].numChildren == 0){        fprintf (f, "%d\n", cur);        treeArray[cur].size = 1;        return;    }     fprintf (f, "%d\n", cur);    testCache(treeArray[cur].lpos, f);    fprintf (f, "%d\n", cur);    testCache(treeArray[cur].rpos, f);     fprintf (f, "%d\n", treeArray[cur].lpos);    fprintf (f, "%d\n", treeArray[cur].rpos);    treeArray[cur].size = treeArray[treeArray[cur].lpos].size + treeArray[treeArray[cur].rpos].size + 1; } 

as a result I plot 999990 element of resulted text file: enter image description here

You can see that for the left tree all elements are locally accessed, but for the right one there exist non-uniformity in accessing.

OLD:

I tried to calculate number of memory reads using valgrind. for right one

valgrind --tool=callgrind --cache-sim ./a.out right ==11493== I   refs:      427,444,674 ==11493== I1  misses:          2,288 ==11493== LLi misses:          2,068 ==11493== I1  miss rate:        0.00% ==11493== LLi miss rate:        0.00% ==11493==  ==11493== D   refs:      213,159,341  (144,095,416 rd + 69,063,925 wr) ==11493== D1  misses:     15,401,346  ( 12,737,497 rd +  2,663,849 wr) ==11493== LLd misses:        329,337  (      7,935 rd +    321,402 wr) ==11493== D1  miss rate:         7.2% (        8.8%   +        3.9%  ) ==11493== LLd miss rate:         0.2% (        0.0%   +        0.5%  ) ==11493==  ==11493== LL refs:        15,403,634  ( 12,739,785 rd +  2,663,849 wr) ==11493== LL misses:         331,405  (     10,003 rd +    321,402 wr) ==11493== LL miss rate:          0.1% (        0.0%   +        0.5%  ) 

and for left one

valgrind --tool=callgrind --cache-sim=yes ./a.out left  ==11496== I   refs:      418,204,722 ==11496== I1  misses:          2,327 ==11496== LLi misses:          2,099 ==11496== I1  miss rate:        0.00% ==11496== LLi miss rate:        0.00% ==11496==  ==11496== D   refs:      204,114,971  (135,076,947 rd + 69,038,024 wr) ==11496== D1  misses:     19,470,268  ( 12,661,123 rd +  6,809,145 wr) ==11496== LLd misses:        306,948  (      7,935 rd +    299,013 wr) ==11496== D1  miss rate:         9.5% (        9.4%   +        9.9%  ) ==11496== LLd miss rate:         0.2% (        0.0%   +        0.4%  ) ==11496==  ==11496== LL refs:        19,472,595  ( 12,663,450 rd +  6,809,145 wr) ==11496== LL misses:         309,047  (     10,034 rd +    299,013 wr) ==11496== LL miss rate:          0.0% (        0.0%   +        0.4%  ) 

As you can see number of memory read 'rd' in 'right' case bigger that in left

like image 160
vadikrobot Avatar answered Oct 14 '22 16:10

vadikrobot


The Cache-misses are different because of the location of the nodes in our memory. If you access nodes in the order they stand in the memmory, it is likely that the cache has already loaded them from the ram in the cache (because the load cache pages (most likely bigger than one of your nodes)).

If you access nodes in a random order (in perspective to the position in the RAM) or in reverse order, it gets more likely that the cache has not yet loaded them from RAM.

So the difference is not because of the structure of your tree, but of the position of the tree-nodes in your RAM in comparison to the order you want to access them.


EDIT: (after access pattern was added to question):

As you can see on your access pattern graph:
On the "left going tree" the access jumps from low to high indexes after about the half of the accesses. So the second half will likely always lead to cache misses as the distance grows and grows.
On the "right going tree" the second half has at least 2 nodes near to each other (in order of access) and also the next two are with a bit luck sometimes on the same cache page.

like image 36
MrSmith42 Avatar answered Oct 14 '22 15:10

MrSmith42