serialized data structures (Was: Generalized, named, and exportable default declarations)

We have seen a lot of effort of better integrating Text into Haskell programming. The only purpose of doing so is to replace String by something more space and time efficient. What would happen if we invest equally much time into making String as efficient as Text? At ICFP 2019 I attended a talk about Gibbon:
https://github.com/iu-parfunc/gibbon
The idea of the project is to serialize (Haskell's) tree data structures in memory as much as possible. Wouldn't this enable us to use String instead of Text, again, maybe even lists instead of Vectors? No more Text integration efforts, no more external library with GHC-specific manual optimizations. Unfortunately, the project is still in an early stage. So far, it only supports strict data structures.
What if we would not complicate the language and generalize syntactic sugar for Text, but instead improve data layout for all Haskell types and eventually make a custom Text type unnecessary?
So essentially Gibbon's aim is to make String work like Text under the hood, without me having to worry about it? +1 for that! Probably I should not comment on this, because I know too little about compilers and serialization. But having to expect something like OverloadedStrings for arbitrary type classes, plus which effect import statements have on this, puts extra cognitive load on the programmer. In contrast, the direction taken by Gibbon removes worries, namely: How much memory penalty do I pay by using this easy-to-understand algebraic data type instead of a more low-level compact representation? I love GHC specifically for its ability to perform many optimizations that Icould not have come up with. Indeed the GHC devs should strive to make the default [*] types work better, as this aligns with the mission of the Haskell Foundation to promote and widen adoption of Haskell. Olaf [*] "default" as dictated by the language report.

On Tue, 13 Apr 2021, Olaf Klinke wrote:
What if we would not complicate the language and generalize syntactic sugar for Text, but instead improve data layout for all Haskell types and eventually make a custom Text type unnecessary?
So essentially Gibbon's aim is to make String work like Text under the hood, without me having to worry about it? +1 for that!
In principle yes. I think the idea is the following: Linked data structures were invented decades ago in order to easily perform in-place updates, e.g. insert and remove elements from lists and trees. However, in Haskell we do not allow in-place modifications and copying a data structure is not much more expensive than traversing it. But if we copy the whole tree anyway why do we still need pointers? We have to acknowledge that linked lists come with some costs. A linked list with elements spread across memory have cache-unfriendly access patterns. In contrast to that a serialized tree is compact in memory and very cache friendly. You can still access a subtree without copying. Yet, linked data structures allow sharing of sub-trees or common list tails. For this case we still need pointers. But maybe not as many as today. Gibbon solves this by using two alternative internal identifiers for every data constructor: One for an embedded subtree and one for a pointer to a subtree. This way a list or a String could be hold in one memory chunk or it could be effectively a linked list of chunks or a linked list of single characters if required.

On Tue, 2021-04-13 at 22:23 +0200, Henning Thielemann wrote:
On Tue, 13 Apr 2021, Olaf Klinke wrote:
What if we would not complicate the language and generalize syntactic sugar for Text, but instead improve data layout for all Haskell types and eventually make a custom Text type unnecessary?
So essentially Gibbon's aim is to make String work like Text under the hood, without me having to worry about it? +1 for that!
In principle yes.
I think the idea is the following: Linked data structures were invented decades ago in order to easily perform in-place updates, e.g. insert and remove elements from lists and trees. However, in Haskell we do not allow in-place modifications and copying a data structure is not much more expensive than traversing it. But if we copy the whole tree anyway why do we still need pointers? We have to acknowledge that linked lists come with some costs. A linked list with elements spread across memory have cache-unfriendly access patterns. In contrast to that a serialized tree is compact in memory and very cache friendly. You can still access a subtree without copying.
Yet, linked data structures allow sharing of sub-trees or common list tails. For this case we still need pointers. But maybe not as many as today. Gibbon solves this by using two alternative internal identifiers for every data constructor: One for an embedded subtree and one for a pointer to a subtree.
This way a list or a String could be hold in one memory chunk or it could be effectively a linked list of chunks or a linked list of single characters if required.
Aha. So when I bind ys = tail xs where xs is compactly represented in memory, then the memory representation of xs is broken up into (head xs : ys) where ys is still compactly represented? I'm thinking procedurally here, which is probably not the way the compliler handles things. How does this relate to "compact regions"? I remember an announcement to this list not so long ago. Olaf
participants (2)
-
Henning Thielemann
-
Olaf Klinke