In Excel, they 'compress' strings to a numerical mapping (though I'm not sure that the word compress is correct in this case). Here is an example shown below:
While this helps to reduce the overall filesize and memory footprint, how then does Excel do sorting on a string field? Would every single string need to go through the lookup mapping: and if so, wouldn't that greatly increase the cost of / slowing down doing a sort on a string field (what if there were 1M values, 1M key lookups wouldn't be trivial). Two questions on this:
I can't find how exactly Excel stores cells with SharedStringTable
elements in-memory at runtime, but storing them as an index of the item in SharedStringTable
requires just one extra dereference to access them, assuming that the elements are stored as an array. So my guess is that this is how it is done. That is the simplest way and the only way to make it faster is having runtime representation of SharedStringTable
already sorted by elements. In such case sorting by an index is equivalent to sorting by the value. That approach, however, makes the insertion operation costly as when a new string is inserted into the middle of the table all the indexes larger than it should be incremented and the number of such cells in the document can be very large, up to all the cells referring to SharedStringTable
.
If the cells contain indexes same as in the file, here is how one would sort the cells represented by columnValue
vector based on the strings they are pointing to stored in the sharedStrings
vector (in C++ since you said there is no difference) at a cost of 2 extra dereferences per comparison operation:
// sort indexes from columnValue based on comparing values in sharedStrings
sort(columnValue.begin(), columnValue.end(),
[&sharedStrings](size_t i1, size_t i2){return sharedStrings[i1] < sharedStrings[i2];});
It wasn't in the OP, but the reverse SharedStringTable
lookup operation is slow and caching elements into a dictionary helps.
Microsoft Excel Shared Strings Table
Shared strings table is and Open XML standard, as defined by ISO standard - ISO/IEC 29500-1:2016(E)
Official definition of Shared strings (cited from ISO document)
Shared String Table
String values may be stored directly inside spreadsheet cell elements; however, storing the same value inside multiple cell elements can result in very large worksheet Parts, possibly resulting in performance degradation. The Shared String Table is an indexed list of string values, shared across the workbook, which allows implementations to store values only once.
ISO standard on Shared Strings can be downloaded from
https://standards.iso.org/ittf/PubliclyAvailableStandards/c071691_ISO_IEC_29500-1_2016.zip
Answers to the questions on this topic
Question 1: Are shared strings used within the Excel application itself, or only when saving the data?
Answer: Shared strings are used by Excel only at the time of saving the document, I.E., only for the purpose of storing the spreadsheet as a file on storage.
However, when the file is opened for display, the cells are populated with actual string values pulled from the shared strings table.
-
Question 2: What would be an example algorithm to sort on the field then? Any language is fine (c, c#, c++, python).
Answer: For an application like Excel, I guess that a special proprietary variation of Quick sort is the most likely algorithm to be used for sorting on string values.
Excel has a limit of 1,048,576 rows. For this size, Quick sort is definitely a winner. Quick sort can produce very efficient result for data set of this magnitude.
Here is the link to the implementation of Quick Sort in C++ for sorting strings:
http://www.cplusplus.com/forum/beginner/101599/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With