I have a dictionary<string, int[]>
which I need to store and retrieve as efficiently as possible from the disk.
The key length (string) will typically vary from 1 to 60 characters (unicode) but could exceed that length (this is however marginal and these values could be discarded). Integers in the array will be in the range 1 to 100 Million. (Typically, 1 to 5M)
My first idea was to use a delimited format:
key [tab] int,int,int,int,...
key2 [tab] int,int,int,int,...
...
and to load the dictionary as follows:
string[] Lines = File.ReadAllLines(sIndexName).ToArray();
string[] keyValues = new string[2];
List<string> lstInts = new List<string>();
// Skip the header line of the index file.
for (int i = 1; i < Lines.Length; i++)
{
lstInts.Clear();
keyValues = Lines[i].Split('\t');
if (keyValues[1].Contains(','))
{
lstInts.AddRange(keyValues[1].Split(','));
}
else
{
lstInts.Add(keyValues[1]);
}
int[] iInts = lstInts.Select(x => int.Parse(x)).ToArray();
Array.Sort(iInts);
dic.Add(keyValues[0], iInts);
}
It works, but going over the potential size requirements, it's obvious this method is never going to scale well enough.
Is there an off-the-shelf solution for this problem or do I need to rework the algorithm completely?
Edit: I am a little embarassed to admit it, but I didn't know dictionaries could just be serialized to binary. I gave it a test run and and it's pretty much what I needed.
Here is the code (suggestions welcome)
public static void saveToFile(Dictionary<string, List<int>> dic)
{
using (FileStream fs = new FileStream(_PATH_TO_BIN, FileMode.OpenOrCreate))
{
BinaryFormatter bf = new BinaryFormatter();
bf.Serialize(fs, dic);
}
}
public static Dictionary<string, List<int>> loadBinFile()
{
FileStream fs = null;
try
{
fs = new FileStream(_PATH_TO_BIN, FileMode.Open);
BinaryFormatter bf = new BinaryFormatter();
return (Dictionary<string, List<int>>)bf.Deserialize(fs);
}
catch
{
return null;
}
}
With a dictionary of 100k entries with an array of 4k integers each, serialization takes 14 seconds and deserialization 10 seconds and the resulting file is 1.6gb.
@Patryk: Please convert your comment to an answer so I can mark it as approved.
The Dictionary<TKey, TValue>
is marked as [Serializable]
(and implements ISerializable
) which can be seen here.
That means you can use e.g. BinaryFormatter
to perform binary serialization and deserialization to and from a stream. Say, FileStream
. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With