Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we write a function in Haskell to measure the number of bytes that a string of n characters occupies?

Suppose I have the string

"abc"

and I want to calculate the number of bytes it occupies in memory.

I could do:

import Data.Bits
finiteBitSize ("abc" :: [Char])

but that breaks because [Char] is not a type supported by the function. (Also it is bits not bytes, but the point was to paint a picture of what I'm looking for).

My question is: Can we write a function in Haskell to measure the number of bytes that a string of n characters occupies?

like image 267
hawkeye Avatar asked Dec 01 '22 09:12

hawkeye


2 Answers

It's complicated.

Let's talk about GHC, and about String specifically, and let's assume the thing has been fully evaluated, so we didn't get to use it iteratively in a GC-friendly way and we didn't get to delay evaluation and store a tiny thunk representing an enormous data structure.

After making all those simplifying assumptions, we're gonna need to know some definitions.

type String = [Char]
data [a] = [] | a : [a] -- pseudosyntax
data Char = C# Char# -- I'm guessing, couldn't find a canonical source

Now we're going to use a few rules of thumb. First: unboxed stuff (like Char#) is generally stored in a machine word. We live in a world of 64-bit machines, so Char# is probably 8 bytes, even though it likely only uses the bottom 4 bytes of it. Second: data constructors are a word to say which constructor, plus a word to point at each of the fields.

Now we're ready to tot it up.

Empty strings are [], one word for the constructor, no words for fields, so one word total.

Nonempty strings are c : cs, so one word for the : constructor, one word to point to c, one word to point to cs, one word for the C# constructor, one word for the Char#. That's 5 words plus however many we need for cs.

So for a String of length n, we've got 5*n words to represent the body of the String and one extra for the terminating []. Practically speaking that's 40 bytes per character! Yikes.

And now you know why packed representations like Text (or, when appropriate, ByteString) are such a big deal.

like image 190
Daniel Wagner Avatar answered Apr 28 '23 03:04

Daniel Wagner


There's a Haskell package called ghc-datasize which allows you to calculate real occupied memory by every Haskell value. You can even see an example of how to calculate string size in the official Hackage documentation:

  • http://hackage.haskell.org/package/ghc-datasize

To calculate the length of the String you need to use the following function:

recursiveSize :: a -> IO Word

It looks like this:

ghci> recursiveSize $!! "foobar"
240
like image 22
Shersh Avatar answered Apr 28 '23 04:04

Shersh