Friday, August 16, 2013

Plus ça change

Let's start with a list:
'(a b c)
We change it:
'(a b c d)
We change the result:
'(0 a b c d)
etc.
'(0 a b c)
etc.
'(a b c)
and now a big change:
'(-1 0 1 a a1 b b1 c d e f)
At each step we compute the indels for that step. If we so desire, we can reconstruct any of the intermediate lists by starting at the beginning and applying the appropriate indels in order.

But what if we skip some? We end up with a mess. Each set of indels is computed relative to the application
of the prior indels. If an indel is omitted, the indices of the subsequent indels will be wrong.

To solve this, we have to change the representation of our sequence (that is, we won't be modifying a list).
Instead, we'll represent our sequence as an ordered set of list segments that we'll concatenate. Insertion is easy — just add a segment. Deletion is slightly harder because we don't want to cross segment boundaries.

Reconstruction of an intermediate list requires more work, but we gain flexibility. We can apply a subset of the insertions and deletions and still get a meaningful result. For example, the first change we made was to create a list with the three elements '(a b c). The second change appended a 'd'. What if we apply the second change but omit the first? We append a 'd' to an empty sequence and get '(d).

How about if we apply only change 3? That inserts a '0' at the beginning giving us '(0).

If we apply change 2 and 3, still omitting 1, we get '(0 d).

That last change is tricky. We've deleted the 'a and prepended '(-1 0 1 a a1), and deleted the 'c and inserted '(b1 c d e f). If we omit change 4 and 5 (which delete the leading 0 and trailing 'd) we'll get '(-1 0 1 a a1 0 b b1 c d e f d). We preserve the order between different insertion points, so the inserted 0 is always before the inserted 'd, but we resolve ambiguous insertions by placing the later insertions before the earlier ones if they are in the same place.

Optional Exercise 7 (quite hard): Implement this as well.

1 comment:

John Cowan said...

I'd change the representation of indels to follow David Durand's Palimpsest design. In this system, rather than using absolute positions for inserts and absolute ranges (pairs of positions) for deletes that are relative to a specific state, positions are labeled as (n, k), meaning "the nth position of the text inserted by the indel whose unique ID is k". This means, for example, that if k1 is an indel representing an insertion into indel k2, and k2 has not been applied, then k1 is a no-op.

In Palimpsest, things are somewhat more complicated, because there are four operations: insert, delete, copy, and move. Consequently, positions must be represented as (n, k1, k2, ... kn) meaning "the nth position of insert indel k1, as copied by copy indels k2 ... kn." In this way, an insert into copied text winds up in both places.