caml-list - the Caml user's mailing list
 help / Atom feed
From: Niols <niols@niols.fr>
To: caml-list@inria.fr
Subject: Re: [Caml-list] O(n ln k) sorting for ocaml on github and a challenge
Date: Mon, 11 Oct 2021 13:46:45 +0200
Message-ID: <875a3212-c4a1-41fb-65f7-2a68b012af9f@niols.fr> (raw)
In-Reply-To: <20211009195910.mqivg4tsafiyb3tr@oulala>

Hi Mário, Christophe,

I'm glad some other people are also doing that kind of things!

On 10/9/21 9:59 PM, Christophe Raffalli wrote:
> On 21-10-09 09:58:12, Mario Pereira wrote:
>> Your implementation reminds me very much of TimSort (for an OCaml
>> implementation, see for instance https://github.com/LesBoloss-es/sorting/blob/
>> master/src/list/timsort.ml). This is also a stable algorithm.
> 
> Yes the idea is similar, but
> 
> - I use a reference to the list for the sorted/reverse-sorted block I build in
>    first phase (like Timsort on arrays, which actually is stable, in place and as
>    good complexity, but you have to get the balancing right, see below)
> 
> - in the second phase, I use a merge, returning sorted list at even depth and
> rev-sorted at odd depth (like OCaml's List.sort). This avoid the Asc/Desc
> constructor and the List.rev.
> 
> - the balancement is ensured by a simple arithmetic computation, not by a stack
> and an invariant on sizes (probably building a balanced binary tree with the
> initial block instead of a list could be a bit better, I am not sure).
> 
> - Last but not least: the code you point seems broken somewhere, it does not
>    ends on large lists (probably quadratic because the balancing with size is
>    probably wrong. cf the paragraph "bug" on wikipedia about Timsort. Anyway, the
>    code you point is probably optimised for arrays.

The repository in question is very experimental. Our implementation of 
TimSort on list is indeed broken. I can't remember if it contains the 
famous TimSort bug, but we are aware that it exists and know of two ways 
to fix it. We quickly switched to working on arrays, though, and were 
planning on coming back to list eventually.

Our implementation of TimSort on arrays can be found here:

 
https://github.com/LesBoloss-es/sorting/blob/master/src/array/timsort.ml

It does include a fix of the bug. We have tested it successfully on a 
rather wide range of arrays.

We have also implemented PeekSort and PowerSort from “Nearly-Optimal 
Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to 
Existing Runs” by J. Ian Munro and Sebastian Wild. cf:

     https://www.wild-inter.net/publications/munro-wild-2018

They use a number of comparisons similar to TimSort. Our implementations 
are copied directly from the implementation of the authors. There is a 
lot of room for improvement to make the runtime much slower but we 
haven't gotten to that yet.

PowerSort, in particular, is designed to be efficient in cache. We 
wondered whether it would adapt nicely to lists but it is also only 
future work.

> My work aims at being a replacement for OCaml sort (for list currently). Here is
> the timing I get:
> 
> Correctness test passed
> Stability test passed
> Random lists:
>            random: tf = 1.44, tg = 1.39, factor = 1.04x, gain = 3.70%
>      random small: tf = 1.10, tg = 1.19, factor = 0.92x, gain = -8.27%
> Worst cases:
>            worst1: tf = 0.83, tg = 0.71, factor = 1.17x, gain = 14.21%
>            worst2: tf = 0.65, tg = 0.70, factor = 0.93x, gain = -7.17%
> Sorted (partially) lists:
>            sorted: tf = 0.65, tg = 0.01, factor = 97.86x, gain = 98.98%
>          reversed: tf = 0.65, tg = 0.09, factor = 7.62x, gain = 86.88%
>        sorted@rev: tf = 0.65, tg = 0.21, factor = 3.14x, gain = 68.19%
>        rev@sorted: tf = 0.65, tg = 0.21, factor = 3.13x, gain = 68.03%
> Shuffled lists (permute k times 2 elements in a sorted list):
>        shuffle 10: tf = 0.66, tg = 0.41, factor = 1.61x, gain = 38.01%
>       shuffle 100: tf = 0.64, tg = 0.51, factor = 1.25x, gain = 20.28%
>      shuffle 1000: tf = 0.63, tg = 0.56, factor = 1.14x, gain = 11.94%
>     shuffle 10000: tf = 0.64, tg = 0.61, factor = 1.04x, gain = 4.30%
> 
> factor > 1 means Block_sort is faster that List.sort. Remark that in the case of
> sorted list, my algorithm returns the original list using constant memory.

I guess we had similar goals. I do think it can be nice to have a 
sorting algorithm in the standard library that behaves well on lists 
that have some order in them, as such lists occur quite often in real life.

We also obtained similar results when comparing to the standard library 
(for arrays): On purely random lists, our TimSort is ~20% slower and 
uses some more comparisons. As soon as the lists contain some order, 
TimSort beats the standard library by far both in term of runtime and 
comparisons.

We were planning on starting to work again on this project soon-ish 
(hopefully within the coming month), and in particular we wanted to 
optimise a bit more PeekSort and PowerSort and then package the whole 
thing in a library. We'll follow closely what you are doing so as to not 
clash with your work!

Cheers,
Niols

  reply index

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-09  4:05 Christophe Raffalli
2021-10-09  8:58 ` Mario Pereira
2021-10-09 19:59   ` Christophe Raffalli
2021-10-11 11:46     ` Niols [this message]
2021-10-11 21:27       ` Mario Pereira
2021-10-22  4:44 ` Kazuhiko Sakaguchi

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875a3212-c4a1-41fb-65f7-2a68b012af9f@niols.fr \
    --to=niols@niols.fr \
    --cc=caml-list@inria.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

caml-list - the Caml user's mailing list

Archives are clonable: git clone --mirror https://inbox.ocaml.org/caml-list

AGPL code for this site: git clone https://public-inbox.org/ public-inbox