Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexing an array with a string (C)

I have an array of unsigned integers, each corresponding to a string with 12 characters, that can contain 4 different characters, namely 'A','B','C','D'. Thus the array will contain 4^12 = 16777216 elements. The ordering of the elements in the array is arbitrary; I can choose which one corresponds to each string. So far, I have implemented this as simply as that:

unsigned int my_array[16777216];
char my_string[12];
int index = string_to_index(my_string);

my_array[index] = ...;

string_to_index() simply assigns 2 bits per character like this: A --> 00, B --> 01, C --> 10, D --> 11 For example, ABCDABCDABCD corresponds to the index (000110110001101100011011)2 = (1776411)10

However, I know for a fact that each string that is used to access the array is the previous string shifted once to the left with a new last character. For example after I access with ABCDABCDABCD, the next access will use BCDABCDABCDA, or BCDABCDABCDB, BCDABCDABCDC, BCDABCDABCDD.

So my question is: Is there a better way to implement the string_to_index function to take under consideration this last fact, so that elements that are consecutively accessed are closer in the array? I am hoping to improve my caching performance by doing so.

edit: Maybe I was not very clear: I am looking for a completely different string to index correspondence scheme, so that the indexes of ABCDABCDABCD and BCDABCDABCDA are closer.

like image 975
Cantfindname Avatar asked May 23 '14 12:05

Cantfindname


1 Answers

If the following assumptions are true for your problem then the solution you implemented is best one.

  1. The right most char of next string is randomly selected with equal probability for each valid character
  2. Start of the sequence is not same always (it is random).

Reason: When I first read your question I came up with the following tree: (reduced your problem to string of length three characters and only 2 possible characters A and B for simplicity) Note that left most child of root node (AAA in this case) is always same as root node (AAA) hence I am not building that branch further.

                      AAA
                     /  \
                        AAB       
                       /  \         
                     ABA    ABB
                    /  \    /   \ 
                 BAA   BAB  BBA  BBB

In this tree each node has its next possible sequence as child nodes. To improve on cache you need to traverse this tree using breadth-first traversal and store it in the array in the same order. For the above tree we get following string index combination.

  • AAA 0
  • AAB 1
  • ABA 2
  • ABB 3
  • BAA 4
  • BAB 5
  • BBA 6
  • BBB 7

Assuming value(A) = 0 and value(B) = 1, index can be calculated as

index = 2^0 * (value(string[2])) +  2^1 * (value(string[1])) + 2^2 * (value(string[0]))

This is same solution as you are using. I have written a python script to check this for other combinations too (like string of length 4 characters with A B C as possible characters). Script link

So unless the 2 assumptions made at the beginning are false than your solution already takes care of cache optimisation.

like image 163
Ravichandra Sutrave Avatar answered Oct 21 '22 12:10

Ravichandra Sutrave