Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort algorithm for Excel / SharedStrings

In Excel, they 'compress' strings to a numerical mapping (though I'm not sure that the word compress is correct in this case). Here is an example shown below:

enter image description here

While this helps to reduce the overall filesize and memory footprint, how then does Excel do sorting on a string field? Would every single string need to go through the lookup mapping: and if so, wouldn't that greatly increase the cost of / slowing down doing a sort on a string field (what if there were 1M values, 1M key lookups wouldn't be trivial). Two questions on this:

  1. Are shared strings used within the Excel application itself, or only when saving the data?
  2. What would be an example algorithm to sort on the field then? Any language is fine (c, c#, c++, python).
like image 409
David542 Avatar asked Jan 15 '20 19:01

David542


Video Answer


2 Answers

I can't find how exactly Excel stores cells with SharedStringTable elements in-memory at runtime, but storing them as an index of the item in SharedStringTable requires just one extra dereference to access them, assuming that the elements are stored as an array. So my guess is that this is how it is done. That is the simplest way and the only way to make it faster is having runtime representation of SharedStringTable already sorted by elements. In such case sorting by an index is equivalent to sorting by the value. That approach, however, makes the insertion operation costly as when a new string is inserted into the middle of the table all the indexes larger than it should be incremented and the number of such cells in the document can be very large, up to all the cells referring to SharedStringTable.

If the cells contain indexes same as in the file, here is how one would sort the cells represented by columnValue vector based on the strings they are pointing to stored in the sharedStrings vector (in C++ since you said there is no difference) at a cost of 2 extra dereferences per comparison operation:

// sort indexes from columnValue based on comparing values in sharedStrings
sort(columnValue.begin(), columnValue.end(), 
     [&sharedStrings](size_t i1, size_t i2){return sharedStrings[i1] < sharedStrings[i2];});

It wasn't in the OP, but the reverse SharedStringTable lookup operation is slow and caching elements into a dictionary helps.

like image 73
isp-zax Avatar answered Oct 26 '22 05:10

isp-zax


Microsoft Excel Shared Strings Table

Shared strings table is and Open XML standard, as defined by ISO standard - ISO/IEC 29500-1:2016(E)

Official definition of Shared strings (cited from ISO document)

Shared String Table

String values may be stored directly inside spreadsheet cell elements; however, storing the same value inside multiple cell elements can result in very large worksheet Parts, possibly resulting in performance degradation. The Shared String Table is an indexed list of string values, shared across the workbook, which allows implementations to store values only once.

ISO standard on Shared Strings can be downloaded from

https://standards.iso.org/ittf/PubliclyAvailableStandards/c071691_ISO_IEC_29500-1_2016.zip

Answers to the questions on this topic

Question 1: Are shared strings used within the Excel application itself, or only when saving the data?

Answer: Shared strings are used by Excel only at the time of saving the document, I.E., only for the purpose of storing the spreadsheet as a file on storage.

However, when the file is opened for display, the cells are populated with actual string values pulled from the shared strings table.

-

Question 2: What would be an example algorithm to sort on the field then? Any language is fine (c, c#, c++, python).

Answer: For an application like Excel, I guess that a special proprietary variation of Quick sort is the most likely algorithm to be used for sorting on string values.

Excel has a limit of 1,048,576 rows. For this size, Quick sort is definitely a winner. Quick sort can produce very efficient result for data set of this magnitude.

Here is the link to the implementation of Quick Sort in C++ for sorting strings:

http://www.cplusplus.com/forum/beginner/101599/

like image 22
Gopinath Avatar answered Oct 26 '22 05:10

Gopinath