Monday, January 13, 2020

Cons cells vs. Linked Lists

Cons cells and linked lists are the meat and potatoes of Lisp programming. Linked lists are the primary structure that everything operates on and cons cells are the Lego blocks they are made of. For an experienced Lisp programmer, cons cells just fade into the background. You know they are there as the glue holding everything together, but it is the linked list that you keep in mind. One could construct all sorts of weird trees, dags, and graphs out of cons cells, but in general you keep things in nice linear singly-linked lists terminated with a nice, full-stop NIL.

Cons cells are nearly the perfect concrete implementation of an abstract two-tuple. They are first-class objects: you can assign them to variables, stuff them in arrays, pass and return them as values, and check them for identity. They are orthogonal to other data types; only a cons-cell returns 't to consp. They are opaque — except for the defined operations of car and cdr, you cannot access the contents of a cons cell. And while they are usually implemented as adjacent memory locations, they hide their representation and there have been many Lisps that have used unusual concrete representations of cons cells like parallel arrays of the car and cdr parts or bit codes to omit the cdr altogether through “cdr coding”. All operations on cons cells can be reduced to the basic operations cons, consp, car, cdr, (setf car), and (setf cdr). (If we had immutable cons cells, we could even get rid of the last two, but then we'd want some other means for creating circular and semi-circular structure.*)

So I find it somewhat surprising that the standard linked list implementation in Lisp is a just a terrible example of an abstract data type. This no doubt happened because linked lists got standardized well before abstract data types were really understood.

The big problem with linked lists is that instead of being orthogonal to other data types, it is a subdomain of cons-cells. The representation of a singly linked list is completely exposed: it is a cons cell, without even a wrapper object to tell you if you are dealing with the list itself or its representation. It is only by common convention that certain cons cell structures are considered to represent linked lists. And it isn't immediately clear whether the representation is meant to be a pointer to the first pair of the list, or to the entire “spine” of the list. It is often treated both ways. There is little distinction between a list primitive and a cons cell primitive, which usually doesn't get you into trouble, except in those few cases where it can cause major confusion, like when you have to handle “improper” or “dotted” lists.

Lists are mutable because their representation is mutable and not hidden. It is possible to mutate the representation such that it no longer represents a list anymore, “magically” changing any list that includes the mutated structure into something else. This means either a lot of defensive copying must be done if lists are used as arguments or passed as values, or an unenforced convention to avoid mutation of list structure must be developed in the Lisp culture. We've been pretty good at the latter, even documenting when you can and when you cannot rely on lists being mutated by library functions, but there are always a few people who go against the grain for the sake of “efficiency” (or plain orneriness) and write code that is impossible to use because you cannot easily tell what might be mutated behind your back.

With any abstract data type, there are conceptually a pair of functions that are used to transport objects across the abstraction barrier. One, call it abs->rep, takes an abstract object and exposes its representation. It is usually provided automatically by the compiler and called upon entry to the object's methods. In Java, for example, it establishes bindings for the this pointer and the private and protected fields of the object so that the method can use them. The complimentary function, call it rep->abs takes the representation of an object and hides it in an opaque, abstract version for clients of the object to use. The clients have no way to manipulate the representation of the object because they only have access to opaque, abstract version. In Java, for example, the compiler automatically does this after object construction and when the this pointer is returned properly cast to the abstract data type. The this pointer and private and protected fields of the object go out of scope and are no longer accessible.

These functions are usually provided by the compiler and often have no real implementation. The compiler simply ensures that representation comes into scope when the method is called (conceptually calling abs->rep) and that the representation goes out of scope when the method returns (conceptually calling rep->abs). No actual code is generated or exists at run time. It's easy to forget this is happening because the compiler does all the work for you. You just toggle the little bit in your head about whether you are “inside” the object or “outside” the object. If you forget, you can just examine the lexical nesting to see if the representation is in scope.

In Lisp, however, for a singly linked list, not only are these functions omitted, they are completely fictitious. It is only in the programmers head that what was once considered a linked list is now to be considered a pointer to head cell of list (abs->rep) and only probably in the programmers head that the reverse (rep->abs) is happening on the way out. It doesn't matter much if he or she forgets this because the written code is the same either way. It only matters if he or she somewhere down the line uses a cons-cell operation where a list operation is actually what should be used. This can lead to common rookie mistakes like
  • Using cons where list is wanted, yielding (1 . 2) where (1 2) is desired. (The “unwanted dot” problem.)
  • Using list where cons is wanted, yielding (1 (2)) where (1 2) is desired. (The “too many parenthesis” problem.)
  • Confusion about whether ((1 2) 3 4) is meant to be a three-tuple of a list and two integers, or a two-tuple of two lists. (It's both, depending on the unwritten intent of the programmer.)
  • Using cons or list where append is wanted, yielding ((1 2) 3 4) or ((1 2) (3 4)) when (1 2 3 4) is desired. (Again, “too many parenthesis”.)
  • Use of (append ... (list <element>)) to “cons” to the “right end” of a list, leading to O(n2) algorithms rather than O(n).
Now don't get me wrong. I like Lisp and I like linked lists. And I'm not suggesting we avoid using them in favor of some other well-designed abstract data type. I just think they're an awful example of how to implement an abstract data type and perhaps that's why it is difficult for beginners to learn how to use them properly. It might also be worthwhile to implement a Lisp with proper (and immutable) abstract linked lists. It wouldn't make much difference to experienced programmers who are already used to applying the representation/abstraction interface in their heads, but it might make it easier for novices to manipulate linked list and cons cells (and keep them apart).

If you want to be completely contrary, consider Olin Shiver's suggestion: all objects — cons cells, strings, integers, null, etc. — are lists. It's just that every object other than a cons cell is a zero element dotted list. Now rather than being a subtype of cons cells, lists become a supertype of all objects. This viewpoint can probably be made coherent, but it does raise a lot of questions. Here are some that come to mind:
  • Is (length '(1 2 . 3)) the same as (length '(1 2 3))? If not, what is (length '(1 2 . 3))
  • Should lists retain their “dottedness” when passed through functions like memq or map? What is (memq 2 '(1 2 . 3))? What about (memq 3 '(1 2 . 3))?
  • What is (reverse '(1 2 . 3))? Is (compose reverse reverse) an identity?
This was extensively discussed on the SRFI-1 mailing list, so I won't rehash the discussion here. The questions I raised above, and many more, were raised and discussed. Eventually, it was decided that continuing to be backwards compatible was an important consideration. (Personally, I think the notion plays havoc with the group theoretic properties of lists, and that is enough to make it suspect.)

There is a good argument that “dotted” lists are rarely used and almost always a mistake, but they are built in to the grammar of Scheme as an indicator of “rest” arguments, so getting rid of them would require some other way to specify “rest” arguments. Racket takes things further by allowing doubly dotted lists to indicate infix notation: (a . < . b)

Just for kicks, I took things in the other direction and wrote some C# code that implements singly-linked lists as their own abstract data type using special, immutable cons cells that require that their CDR be either an existing singly-linked list or the empty list. “Dotted” lists are not a problem because you simply cannot construct one. The representation of a list is explicitly coded as a pointer to the head cons cell of the list. The code illustrates how the abstract list is turned into a the pointer to the cons cell when it is carried across the abstraction barrier and how it is turned back into an abstract list when carried back out. Again, I'm not suggesting anyone use the code, or take it as a serious proposal. (For one thing, it doesn't address what to do about circular lists, or the dotted lists in the Scheme grammar.) It was just a fun hack for illustrative purposes. It is available here for those interested.

*Many years back, Henry Baker said “C'mon, cons cells should just be immutable.” (if I am remembering the exact quote correctly). I agree with his sentiment. Combine immutable cons cells with “hash consing” and the appropriate equality primitives and you get directed acyclic graphs (and their space properties) “for free”. We'd either have to do without circular structure or use another means to achieve it. Since circular structure often leads to divergent programs I wouldn't consider it a great loss, but some may disagree. Perhaps they might be assuaged by a nice set of primitive procedures for creating and manipulating circular cons cell structure.

4 comments:

Joe Marshall said...

I'm told that Clojure implements linked lists as their own ADT with immutable cons cells. Good for them!

Byron Schlemmer said...

Enjoyed reading that. As lisp novice it cleared up a few mysteries for me.

patrickdlogan said...

Immutable lists in Scheme: https://srfi.schemers.org/srfi-116/srfi-116.html

Also Scheme and Common Lisp have mutable collections along the lines of Clojure, e.g. FSet: https://common-lisp.net/project/fset/Site/FSet-CL.html

John Cowan said...

Fortunately for us all, most lists are immutable by convention in most Lisps. In Racket, mutable and immutable pairs are disjoint altogether, and the traditional pairs are the immutable ones, so that set-car and set-cdr don't exist.

But SRFI 127 exploits mutability and improper lists to create an immutable-by-convention lazy sequence (lseq) type. The implementation strategy is that a proper list is a fully realized lseq, but an improper list has a (normally stateful) thunk in its tail that when invoked generates the next element, and this is the general case. So lseq-cdr, which is the heart of the library, works like this:

Fetch the cdr of the argument. If it's (), return it; if it's not a procedure, it's an error. But if it is a procedure, invoke it to get the next element and allocate a new pair whose car is the next element and whose cdr is the thunk. Then set the cdr of the argument to be the new pair, and the invariant is maintained but the lseq is one item longer. As a special case, if the thunk returns an end of file object, no new pair is allocated and the cdr of the argument is set to (), as the list is considered to be fully realized.

Lseqs are like (deprecated) SRFI 40 streams: they are odd, in the sense that unless the lseq is an empty list, there must always be at least one realized element. But unlike SRFIs 40 and 41 (even streams), this library contains only procedures and requires no macros, and the realized elements are just values rather than promises.

It is the essence of this library that it's built on Lisp lists (which are an abstraction, though they don't have an abstraction barrier) and the library is much smaller and simpler because the abstraction it supplies doesn't have a barrier either. Procedures are included in the library if they can be implemented without examining all the elements. (There are a few exceptions for convenience, like lseq-length.)

For example, there is no lseq-reverse, because that could only be lseq-realize composed with reverse. But there is lseq-member, since that stops as soon as it finds an appropriate value and need not generate any more. Lseq-fold and lseq-for-each are provided because it is not uncommon to abort fold and for-each early with call/cc.