caml-list - the Caml user's mailing list
 help / Atom feed
From: Ivan Gotovchits <ivg@ieee.org>
To: Oleg <oleg@okmij.org>, Ivan Gotovchits <ivg@ieee.org>, caml-list <caml-list@inria.fr>
Subject: Re: [Caml-list] Type-indexed heterogeneous collections (Was: Implicits for the masses)
Date: Tue, 10 Sep 2019 15:03:15 -0400
Message-ID: <CALdWJ+zsz99-rSa3S9JMmGUSWM6_PHTg7RF4+COOGsKDTXsZKw@mail.gmail.com> (raw)
In-Reply-To: <20190910144016.GA1725@Melchior.localnet>

[-- Attachment #1: Type: text/plain, Size: 9327 bytes --]

On Tue, Sep 10, 2019 at 10:38 AM Oleg <oleg@okmij.org> wrote:

>
> > https://github.com/BinaryAnalysisPlatform/bap - the BAP project per se.
> > ...
> That is quite a project! And interesting, too. Thank you for letting
> me know. Now I understand what you mean by ``universe of types
> actually grew quite large, that large that the linear search in the
> registry is not an option for us anymore.''
>
> What your message has made it very clear, to me, is that we really
> need the standard type representation library -- hopefully being part
> of Stdlib and eventually integrated with the compiler. For one, it is
> high time we standardized the spelling of the ('a,'b) eq type and its
> constructor. Mainly, as your message has well demonstrated, an
> efficient trep is actually non-trivial, and takes lots of experience
> to design. Incidentally, the situation on Haskell is very similar:
> Typeable is now well integrated with the compiler. It is quite a
> popular library.
>

Definitely! So far the closest to what could be called a standard
representation is the Base.Type_equal [1] module (and its extensions [2]).
But unfortunately, it comes with strings attached and that hampers its
adoption in other projects.
Basically, we need at least to put the data constructor in a separate
library (as it was done for the `result` type previously). Going beyond
just defining the data type would require some consensus. A good example
would be deciding how to implement the total order relation on type
indices. The Base Library is using `Obj.extension_constructor` which has a
few problems:
1) It is in the Obj module with all the ramifications
2) The implementation is inefficient and overly cautious (so cautious that
it doesn't work with the Ancient module, for example)
3) The corresponding index is too global, not only shared by all extensible
variants and exceptions but every new object increments the identifier
(i.e., it is the same global counter which is used by `Oo.id`,
which is probably an error which should be fixed in the compiler),
therefore there is a chance to overflow it.

That's why we decided to implement our own indexing, for example.

[1]: https://github.com/janestreet/base/blob/master/src/type_equal.mli
[2]: https://github.com/janestreet/typerep


>
> I'd also like to draw attention to truly type-indexed heterogeneous
> collections where trep is a part of a key rather than a part of a
> value.


I like to treat such structures as a reification of OCaml records so that
trep is now a first-class field and a record is an n-tuple with named
fields. Since the fields are named, the n-tuple can grow and shrink, i.e.,
the size is not fixed. We can implement the n-tuple using heterogeneous
maps, hashtables, or just an assoc list. So far we had a few libraries
which provided heterogeneous maps indexed by the type index, the `Univ_map`
[3] from Janestreet and Hmap [4] from Daniel Bünzli, to name a few.

Both libraries follow the same approach, namely parametrizing the existing
non-heterogeneous Map functor with the order function induced by the age of
a key, with some minor differences, e.g., Base's version is using the int
type as the key (where the key denotes the extension constructor
identifier), while Daniel is using the trep itself as the key, and
extracting the order from the age of the key.
However, though semantically data is indeed indexed by the trpep,
syntactically it is represented with an extra layer of indirection, e.g.,
instead of having a single binding

     type 'key binding = Binding : 'a trep * 'a -> 'key binding

we have a non-heterogeneous binding

    type ('a,'b) binding = Binding of 'a * 'b

which packs a universal something like this as `Binding (trep,Pack
(trep,x))`.

This creates a lot of overhead memory-wise and (given that OCaml
performance is usually conditioned by the allocation rate)
performance-wise. OCaml AVL-trees are already having a lot of overhead,
about 5 extra words per bindings (Base has a little bit less, but still a
lot - about 3 words per binding). Therefore, (after lots of profiling) we
actually ended up implementing heterogenous AVL trees, which doesn't suffer
from this problem. We ended up with the following representation [5]
```
  type record =
    | T0
    | T1 : 'a key * 'a -> record
    | T2 : 'a key * 'a *
           'b key * 'b -> record
    | T3 : 'a key * 'a *
           'b key * 'b *
           'c key * 'c -> record
    | T4 : 'a key * 'a *
           'b key * 'b *
           'c key * 'c *
           'd key * 'd -> record
    | LL : record * 'a key * 'a * record -> record (* h(x) = h(y) - 1 *)
    | EQ : record * 'a key * 'a * record -> record (* h(x) = h(y) *)
    | LR : record * 'a key * 'a * record -> record (* h(x) = h(y) + 1 *)
```

Which stores the key and data directly on the branches of a tree without
any unnecessary indirection. And also unfolds n-tuples with n < 5, an
implementation detail which is not really important wrt to the current
discussion, (but was a dealbreaker for us as it significantly reduced the
memory consumption and the total number of rebalances, so that at the end
of the day we got a factor of 5 improvements both in time and space).

 Another interesting feature is that if implemented as a functor, this
implementation could be used to derive non-heterogeneous maps (e.g., when
`'a key = int`).

During the implementation, we had to find solutions to a few very
interesting problems, like how to represent an existential folding function
(or how to fold over heterogeneous map)[6]. Or more interesting, how to
update keys and merge two maps [7] and do this efficiently (hint: you have
to use CPS).


[3]: https://github.com/janestreet/core_kernel/blob/master/src/univ_map.ml
[4]: https://github.com/dbuenzli/hmap/blob/master/src/hmap.ml
[5]:
https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L766
[6]:
https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L824
[7]:
https://github.com/BinaryAnalysisPlatform/bap/blob/0352edfe42ba5560b178b6bf990cf15e81d78f57/lib/knowledge/bap_knowledge.ml#L1316

Treating heterogenous maps as a generalization of records brings us to
another interesting development. We do not want to completely forfeit types
of records and treat all records as a member of one universal record type.
We would like to be able to define type schemes so that we can distinguish
between a `student record` and a `course record` while keeping both
universal and extensible. In BAP we implemented this using classes and
slots (borrowing those ideas from frame languages of AI, any resemblance to
OO is purely coincidental). We turn our untyped `record` into a value
parameterized by its class, e.g.,

```
type 'a value = {
     cls : 'a cls;
     data : record;
}
```

The `'a cls` type could be phantom, e.g., `type 'a cls = 'a`, but we can
benefit from storing some type information to implement runtime reflection.
And an extra layer of indirection on top of treps enables the structural
definition of a record type that can span across multiple modules:
```
type ('a,'p) slot = {
   cls : 'a cls;
   key : 'p key;
}
```

so that now a class is defined by its slots, e.g.,

```
val name : (student, string) slot
val age : (student, age) slot
...
val teacher : (course, teacher) slot
```

and an abstract interface of the Value will prevent us from confusing
different fields:
```
module Value : sig
  type 'a t
  val get : ('a,'p) slot -> 'a t -> 'p
  val put : ('a,'p) slot -> 'a t -> p -> 'a t
end
```

The real implementation in BAP [8], is a little bit more convoluted since
we want to create hierarchies of classes (which we need for type-safe
representation of their denotational semantics) and make classes indexable
by runtime values. Therefore our class is actually a collection of sorts.
And we also have triggers (i.e., computable fields), as well as we use
domains (aka Scott-Ershov domains) to represent the properties of classes,
which gives us a nice feature - our `Value.get` function is total as for
each type of property we have the bottom value as the default.

Finally, we can store our values in a database and hide this storage under
a monad interface, so that we can implement different access schemes (and
put a fixed point computation inside), but this is a completely different
story. The takeaway is that OCaml as a language is now fully equipped for
implementing efficient heterogeneous data structures and can be used in
domains which previously were previously dominated by languages with
dynamic typing.

Besides, the Knowledge library is totally independent on BAP despite the
fact that it is currently distributed from the BAP repository. We've tried
not to leak any domain-specific details into the knowledge abstraction
albeit keeping it versatile enough to be able to represent the knowledge
representation of the program semantics.

[8]:
https://github.com/BinaryAnalysisPlatform/bap/blob/master/lib/knowledge/bap_knowledge.mli#L153

Cheers,
Ivan Gotovchits

[-- Attachment #2: Type: text/html, Size: 12168 bytes --]

      reply index

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-04 15:23 [Caml-list] Implicits for the masses Oleg
2019-09-04 20:40 ` Ivan Gotovchits
2019-09-10 14:38   ` [Caml-list] Type-indexed heterogeneous collections (Was: Implicits for the masses) Oleg
2019-09-10 19:01     ` Ivan Gotovchits [this message]

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALdWJ+zsz99-rSa3S9JMmGUSWM6_PHTg7RF4+COOGsKDTXsZKw@mail.gmail.com \
    --to=ivg@ieee.org \
    --cc=caml-list@inria.fr \
    --cc=oleg@okmij.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

caml-list - the Caml user's mailing list

Archives are clonable: git clone --mirror https://inbox.ocaml.org/caml-list

AGPL code for this site: git clone https://public-inbox.org/ public-inbox