Please consider the following data model:
data Artist = Artist Text
data Song = Song Artist Text
data Catalogue = Catalogue (Set Artist) (Set Song)
You can see that the Artists are referred to from both the Songs and the Catalogue. The Catalogue contains a list of all artists referred to from Songs, so the same values of Artist get referred to from two places.
Suppose we were to generate the Catalogue value using multiple applications of the following function:
insertSong :: Song -> Catalogue -> Catalogue
insertSong song@(Song artist title) (Catalogue artists songs) =
Catalogue (Set.insert artist artists) (Set.insert song songs)
It's evident that the Catalogue would get filled by references to the same values of Artist as the Songs refer to, thus saving the memory by not storing the copies of those values.
The problem is that when I try to recreate the catalogue from serialized data by separately deserializing a set of artists and a set of songs, the application occupies way more memory than when it generated the same value of Catalogue with insertSong. I suspect that it is caused by the lost relation between same Artists referred to from Songs and the Catalogue, which is why I get copies of values of Artist occupying extra memory.
The only solution I see is to first deserialize the set of artists and then to deserialize the set of songs while forcefully replacing the values of Artist with the ones from the first set.
So my questions are:
Note that sharing will be also lost if you do any kind of computation on your strings (i.e. even if artist1 and artist2 are the same and shared, f artist1 and f artist2 are probably not). If this becomes a problem, you can do similar changes to your data structures, too.
A simple solution seems to cache data using a somewhat degenerated map:
{-# LANGUAGE DeriveDataTypeable, RankNTypes #-}
import Control.Monad
import Control.Monad.State
import Data.Map (Map)
import qualified Data.Map as M
type Cache a = Map a a
We can then query this cache if there is already an entry equal to this one, and replace it with the cached one:
cached :: (Ord a) => a -> State (Cache a) a
cached x = state $ \m ->
case M.lookup x m of
Just x' -> (x', m)
Nothing -> (x, M.insert x x m)
This way, if we load several equal elements of type a, we transform them into a single. This can be done during deserialization or once at the end.
Perhaps it would be possible to generalize it further and use SYB to map all values of some given type in a data structure through a cache:
import Data.Data (Data)
import Data.Generics.Aliases (mkM)
import Data.Generics.Schemes (everywhereM)
import Data.Typeable (Typeable)
replaceFromCache
:: (Ord a, Typeable a, Data b)
=> b -> State (Cache a) b
replaceFromCache = everywhereM (mkM cached)
Then we can replace all artists in some data structure like
data Artist = Artist String
deriving (Eq, Ord, Typeable)
cacheAllArtists :: (Data b) => b -> b
cacheAllArtists b = evalState (replaceFromCache b) (M.empty :: Cache Artist)
Or we can use Proxy phantom type to create a general version:
cacheAll :: (Ord a, Typeable a, Data b)
=> Proxy a -> b -> b
cacheAll p = flip evalState (emptyOf p) . replaceFromCache
where
emptyOf p = asTypeOf2 M.empty p
asTypeOf2 :: f a -> Proxy a -> f a
asTypeOf2 = const
cacheAllArtists :: (Data b) => b -> b
cacheAllArtists = cacheAll (Proxy :: Proxy Artist)
(Disclaimer: I haven't tested any of the above code.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With