Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

References to same data and memory allocation

Please consider the following data model:

data Artist = Artist Text
data Song = Song Artist Text
data Catalogue = Catalogue (Set Artist) (Set Song)

You can see that the Artists are referred to from both the Songs and the Catalogue. The Catalogue contains a list of all artists referred to from Songs, so the same values of Artist get referred to from two places.

Suppose we were to generate the Catalogue value using multiple applications of the following function:

insertSong :: Song -> Catalogue -> Catalogue
insertSong song@(Song artist title) (Catalogue artists songs) =
  Catalogue (Set.insert artist artists) (Set.insert song songs)

It's evident that the Catalogue would get filled by references to the same values of Artist as the Songs refer to, thus saving the memory by not storing the copies of those values.

The problem is that when I try to recreate the catalogue from serialized data by separately deserializing a set of artists and a set of songs, the application occupies way more memory than when it generated the same value of Catalogue with insertSong. I suspect that it is caused by the lost relation between same Artists referred to from Songs and the Catalogue, which is why I get copies of values of Artist occupying extra memory.

The only solution I see is to first deserialize the set of artists and then to deserialize the set of songs while forcefully replacing the values of Artist with the ones from the first set.

So my questions are:

  1. Am I right in my suspicion?
  2. Will the solution I see work?
  3. Are there any better ways to solve this?
like image 202
Nikita Volkov Avatar asked Jul 14 '13 21:07

Nikita Volkov


2 Answers

  1. That sounds plausible.
  2. It should work if done right. In particular, you have to ensure that everything is eagerly evaluated, to avoid referencing the old Text values from thunks.
  3. You can choose a smarter serialization format. For example, when you serialize songs, store the index of the artist in the artists list instead of the full artist name. Then look it up during deserialization.

Note that sharing will be also lost if you do any kind of computation on your strings (i.e. even if artist1 and artist2 are the same and shared, f artist1 and f artist2 are probably not). If this becomes a problem, you can do similar changes to your data structures, too.

like image 150
Roman Cheplyaka Avatar answered Sep 29 '22 10:09

Roman Cheplyaka


A simple solution seems to cache data using a somewhat degenerated map:

{-# LANGUAGE DeriveDataTypeable, RankNTypes #-}
import Control.Monad
import Control.Monad.State
import Data.Map (Map)
import qualified Data.Map as M

type Cache a = Map a a

We can then query this cache if there is already an entry equal to this one, and replace it with the cached one:

cached :: (Ord a) => a -> State (Cache a) a
cached x = state $ \m ->
    case M.lookup x m of
        Just x'     -> (x', m)
        Nothing     -> (x, M.insert x x m)

This way, if we load several equal elements of type a, we transform them into a single. This can be done during deserialization or once at the end.


Perhaps it would be possible to generalize it further and use SYB to map all values of some given type in a data structure through a cache:

import Data.Data (Data)
import Data.Generics.Aliases (mkM)
import Data.Generics.Schemes (everywhereM)
import Data.Typeable (Typeable)

replaceFromCache
    :: (Ord a, Typeable a, Data b)
    => b -> State (Cache a) b
replaceFromCache = everywhereM (mkM cached)

Then we can replace all artists in some data structure like

data Artist = Artist String
  deriving (Eq, Ord, Typeable)

cacheAllArtists :: (Data b) => b -> b
cacheAllArtists b = evalState (replaceFromCache b) (M.empty :: Cache Artist)

Or we can use Proxy phantom type to create a general version:

cacheAll :: (Ord a, Typeable a, Data b)
      => Proxy a -> b -> b
cacheAll p = flip evalState (emptyOf p) . replaceFromCache
  where
    emptyOf p = asTypeOf2 M.empty p
    asTypeOf2 :: f a -> Proxy a -> f a
    asTypeOf2 = const

cacheAllArtists :: (Data b) => b -> b
cacheAllArtists = cacheAll (Proxy :: Proxy Artist)

(Disclaimer: I haven't tested any of the above code.)

like image 23
Petr Avatar answered Sep 29 '22 11:09

Petr