Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Serialization in Haskell

From the bird's view, my question is: Is there a universal mechanism for as-is data serialization in Haskell?

Introduction

The origin of the problem does not root in Haskell indeed. Once, I tried to serialize a python dictionary where a hash function of objects was quite heavy. I found that in python, the default dictionary serialization does not save the internal structure of the dictionary but just dumps a list of key-value pairs. As a result, the de-serialization process is time-consuming, and there is no way to struggle with it. I was certain that there is a way in Haskell because, at my glance, there should be no problem transferring a pure Haskell type to a byte-stream automatically using BFS or DFS. Surprisingly, but it does not. This problem was discussed here (citation below)

Currently, there is no way to make HashMap serializable without modifying the HashMap library itself. It is not possible to make Data.HashMap an instance of Generic (for use with cereal) using stand-alone deriving as described by @mergeconflict's answer, because Data.HashMap does not export all its constructors (this is a requirement for GHC). So, the only solution left to serialize the HashMap seems to be to use the toList/fromList interface.

Current Problem

I have quite the same problem with Data.Trie bytestring-trie package. Building a trie for my data is heavily time-consuming and I need a mechanism to serialize and de-serialize this tire. However, it looks like the previous case, I see no way how to make Data.Trie an instance of Generic (or, am I wrong)?

So the questions are:

  1. Is there some kind of a universal mechanism to project a pure Haskell type to a byte string? If no, is it a fundamental restriction or just a lack of implementations?

  2. If no, what is the most painless way to modify the bytestring-trie package to make it the instance of Generic and serialize with Data.Store

like image 979
Sergey Sosnin Avatar asked Jul 01 '21 18:07

Sergey Sosnin


1 Answers

  1. There is a way using compact regions, but there is a big restriction:

Our binary representation contains direct pointers to the info tables of objects in the region. This means that the info tables of the receiving process must be laid out in exactly the same way as from the original process; in practice, this means using static linking, using the exact same binary and turning off ASLR. This API does NOT do any safety checking and will probably segfault if you get it wrong. DO NOT run this on untrusted input.

This also gives insight into universal serialization is not possible currently. Data structures contain very specific pointers which can differ if you're using different binaries. Reading in the raw bytes into another binary will result in invalid pointers.

There is some discussion in this GitHub issue about weakening this requirement.

  1. I think the proper way is to open an issue or pull request upstream to export the data constructors in the internal module. That is what happened with HashMap which is now fully accessible in its internal module.

Update: it seems there is already a similar open issue about this.

like image 196
Noughtmare Avatar answered Sep 21 '22 09:09

Noughtmare