I have like about 100,000 sentences in a List<string>.
I'm trying to split each of these sentences by words and add everything into List<List<string>> where each List contains a sentence and which contains another List of words. I'm doing that because I have to do a different work on each individual words. What would be the size difference of just List<string> of sentences vs List<List<string>> of words in memory?
One of these will be stored in memory eventually so I'm looking for the memory impact of splitting each sentence vs just a string
We'll start with your List<string>. I'm going to assume the 64-bit runtime. Numbers for the 32-bit runtime are slightly smaller.
The List itself requires about 32 bytes (allocation overhead, plus internal variables), plus the backing array of strings. The array overhead is 50 bytes, and you need 8 bytes per string for the references. So if you have 100,000 sentences, you'll need at minimum 800,000 bytes for the array.
The strings themselves require something like 26 bytes each, plus two bytes per character. So if your average sentence is 80 characters, you need 186 bytes per string. Multiplies by 100K strings, that's about 18.5 megabytes. Altogether, your list of sentences will take around 20 MB (round number).
If you split the sentences into words, you now have 100,000 List<string> instances. That's about 5 megabytes just for the List<List<string>>. If we assume 10 words per sentence, then each sentence's list will require about 80 bytes for the backing array, plus 26 bytes per string (total of about 260 bytes), plus the string data itself (8 chars, or 160 bytes total). So each sentence costs you (again, round numbers) 80 + 260 + 160, or 500 bytes. Multiplied by 100,000 sentences, that's 50 MB.
So, very rough numbers, splitting your sentences into a List<List<string>> will occupy 55 or 60 megabytes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With