Thursday, July 18, 2013

Persisting CLOS objects

In a previous post, I described how a programmer would save simple primitive objects to the persistent store. How does a programmer save a CLOS object? Very simply. Here is a class definition:
(defclass test-class ()
  ((name :initarg :name
         :initform 'zippy
         :reader test-class/name)))
Here is the persistent version:
(defclass test-class ()
  ((name :initarg :name
         :initform 'zippy
         :reader test-class/name))
  (:metaclass persistent-standard-class)
  (:schema-version 0))
And here is how you save a persistent instance to the store:
(make-instance 'test-class)
I'll do another one:
(make-instance 'test-class :name 'griffy)
Not too shabby.

The point is that we can abstract away an awful lot of the persistence layer. This is really important because the versioning layer is at least as complex. Wrapping your mind around multiple versioned instances takes practice. It's a good thing that we don't have to think worry about the persistent layer at the same time.

But I said that I'd describe how it works. I have several attempts at description sitting here on my computer, and they are hard to read, hard to undertand, and it simply doesn't seem like it would work correctly. I've tried to logically argue that it does work, and certainly the fact that the code was working is empirical evidence, but I'm still trying to find a clear description so that it simply makes sense that it ought to work. So rather than describe why it ought to work, let me describe what happens beneath the covers.

The code in pstore/pclass.lsp has the implementation. The CLOS meta-object protocol allows you to customize the behavior of the object system by adding your own methods to the internal CLOS implementation. To create a CLOS object, you call make-instance. Magic happens, but part of that magic involves initializing the slots of the newly created object. At this point during the object instantiation magic CLOS calls the generic function shared-initialize. shared-initialize is responsible for assigning values to the slots of an object and it get called on the uninitialized object, the set of slot names to fill, and an argument list. The argument list is normally the same argument list given to make-class. The default behavior of shared-initialize is to match up the keyword-specified initargs with the appropriate slots and stuff the values in. But we'll modify that.
(defmethod clos:shared-initialize ((instance persistent-standard-object) slot-names
                                   &rest initargs
                                   &key persistent-store node-id node-index
                                   &allow-other-keys)

  (if (eq instance *restoring-instance*)
      (call-next-method)
      ;; If we are being called from elsewhere,
      ;; we have to wrap the initargs and initforms
      ;; in persistent-objects and create an initializer
      ;; for this object.
      (let* ((class (class-of instance))

             (init-plist (compute-persistent-slot-initargs class
                                                           (or persistent-store  *default-persistent-store*)
                                                           initargs))

             (node-id (persistent-object/save
                       (make-initializer class
                                         (class-schema-version class)
                                         init-plist)
                       (or persistent-store  *default-persistent-store*)
                       node-id)))

        (apply #'call-next-method instance slot-names (nconc init-plist initargs))

        (setf (persistent-standard-object/node-id instance) node-id)
        (setf (persistent-standard-object/node-index instance) node-index)
        (setf (object-map-info/%cached-value
               (persistent-object/find-object-map-info
                (or persistent-store  *default-persistent-store*)  node-id))
              instance)

        instance)))
First, we check if the instance we are initializing is being restored from the persistent store. When we first open a persistent store and re-instantiate the objects, we do not want the act of re-instatiation to cause the objects to be re-persisted. So in that case we simply invoke call-next-method and let the default actions take place.

But if we are creating a new object, we want it to persist. The call to persistent-object/save does the trick, but notice that we don't pass in the instance. We call make-initializer on the argument list and we save that instead.

An initializer is a simple structure that holds the class, a "schema-version", and the argument list:
(defstruct (initializer
            (:conc-name initializer/)
            (:constructor make-initializer (class schema-version init-plist))
            (:copier nil)
            (:predicate initializer?))
  (class        nil :read-only t   :type persistent-standard-class)
  (schema-version 0 :read-only t   :type non-negative-fixnum)
  (init-plist   '() :read-only t   :type list))
and persistent-object/save serializes it like this:
(:method ((object initializer) stream symbol-table)
    (write-byte serialization-code/initializer stream)
    (write-fixnum (symbol-table/intern-symbol symbol-table (class-name (initializer/class object))) stream)
    (write-fixnum (initializer/schema-version object) stream)
    (write-fixnum (length (initializer/init-plist object)) stream)
    (iterate (((key value) (scan-plist (initializer/init-plist object))))
      (write-fixnum (symbol-table/intern-symbol symbol-table key) stream)
      (serialize value stream symbol-table)))
(I'm skipping over an important detail, but I'll get to it...)

Something unusual is going on here. The persistent object itself is not placed in the store. The argument list passed to make-instance is stored instead. Because the persistent object is immutable, all the information needed to reconstruct the object is present in the initargs, so we don't need the resulting object.

Why would we do this? The object itself has structure. Instantiating the object imposes this structure on the values stored within. The structure of the objects in the store are collectively known as the schema. Persistent stores are intended to hold objects for a long time. We expect the code that manipulates the objects to change over time, and it is likely that we will want to change the object representation on occasion. When we change the object representation, we need to consider the legacy objects that were constructed under the old representation. This is called schema evolution and it is one of the most painful tasks in maintaining an object-oriented database. At its worst, the persistent schema is so different from the code schema that you have only one way to handle the schema change: dump the entire database into a neutral format (like a file full of strings!), create a new, empty database and read it all back in. My experience with other object oriented database is that the worst case is the common case.

If we store only the information needed to reconstruct the object, we no longer need to worry about the object layout. This finesses the problem of schema evolution.

But there is a :schema-version specified in the class definition, and that is most definitely stored. There are two kinds of information in the initargs: the values themselves are obvious, but the interpretation of the values is not. An example should illustrate this.

Suppose we start out a project where we are going to save named objects in the store. At some point in the code we invoke (make-instance 'foo :name "Joe") and so there is an initializer in the store something like [foo (:name "Joe")].

Now suppose that we extend our application. We are going to store family names as well. So we start storing initializers with more data: [foo (:name "John" :family "Smith")] What do we do about the legacy [foo (:name "Joe")]? Let us suppose we decided that we'll just default the missing last name to "Unknown". Everything is cool. Old and new objects live together.

But now we want to extend our application to handle people like Cher and Madonna. We want it to be the case that we can deliberately omit the family name for some people. The initializers will look like [foo (:name "Cher")]. But now we have an ambiguity. We don't know if the family name is omitted on purpose, or whether the object was stored before the family name became important. Do we default the last name to "Unknown" or not?

The :schema-version argument in the class definition is used to disambiguate these cases. When the objects are recovered from the store, the constructor can use this value to decide how to interpret the remainder of the initargs.

Admittedly, this is a bit klunky. But it doesn't complicate things too much. Programmers will have to do two things when changing a persistent class definition: bump the :schema-version, and decide how to reconstruct objects that were stored under the legacy expectations. (Actually, you can punt on these if you can prove that no ambiguous cases will arise.)

Now about that important detail. The initializers we store aren't exactly what we said. Instead, when the persistent class is defined a set of "hidden slots" is created in parallel with the declared slots. The initargs of the hidden slots are not persistent objects, but the persistent object ids of the initargs. We don't store [foo (:name "Joe")], we store [foo (:persistent-initarg-for-name 33)] where 33 is the persistent object id of the persistent string "Joe". I could write a few pages explaining why, but it would be deadly boring. I'm sure you can imagine uses for an extra hidden level of indirection (think multi-value concurrency).  (By the way, notice call to (apply #'call-next-method ...) uses nconc to paste the hidden arguments on the front of the argument list like I mentioned in the previous post.)

Does it work? Mostly. If you look at the code in conman/workspace.lsp you'll find a class with a schema-version of 1 and this method:
(defmethod pstore::restore-instance ((class (eql (find-class 'workspace))) (schema (eql 0)) 
                                     persistent-store node-id node-index init-plist)
  (debug-message 2 "Upgrading schema for workspace.")
  ;; This needs work.  The zeros are the OID of NIL.
  (pstore::restore-instance class 1 persistent-store node-id node-index
                    (list* :added-master-csets 0
                           :removed-master-csets 0
                           :transitional-added-master-csets 0
                           :transitional-removed-master-csets 0
                           init-plist)))
I added four slots to workspace objects. When resoring a workspace from the store, if it was a workspace created before these slots existed, this method overrides the usual restore method. It simply adds the new slots to the front of the init-plist before proceeding with the normal restore-instance. (The use of the number 0 instead of NIL is an implementation defect that I'm too lazy to fix at the moment.)

The problem in explaining this? I don't know an easy proof that storing initializers rather than objects is sufficient in all cases. It's not obvious that this even helps with schema evolution, and it took me a while before I was persuaded that there aren't lurking edge cases. In personal discussions, it takes a while to persuade people that this is in fact a solution to a problem. I'd love to hear a better argument.

8 comments:

John Cowan said...

I don't understand what you think would count as a proof in this context. The arguments passed to `make-instance` are the sum total of the information needed to create the instance in its initial state. Since you require that instances be transitively immutable, the initial state is the only state. If you have those arguments squirreled away, then you can create another instance in the exact same initial state. Q.E.D.

Unknown said...

" The arguments passed to `make-instance` are the sum total of the information needed to create the instance in its initial state"

Well not necessarily. What about any stateful behavior encoded into the objects class? For instance maybe the class maintains an atomic counter for assigning instance ID' or, slots with allocation othe than that of instance (or persistent, since they are handled by his technique.

By the way. Couldn't it just have trapped the out of date schema version in shared nitiakize and dispatched it off to update-instance-for-redefined-class?

John Cowan said...

Fair enough: however, I would assume that the insistence on transitive immutability extends to the class as well as the slots.

The issue with schema evolution is that you don't always know how to migrate an instance from one (version of a) class to another, arbitrary (version of a) class while preserving the semantics, if preserving the semantics is even possible. Anything at all might have changed, so in the general case `update-instance-for-redefined-class` will not save you.

Joe Marshall said...

I would assume that the insistence on transitive immutability extends to the class as well as the slots.

That effectively disallows schema changes.

Simon Leinen said...

Shouldn't the instantiation examples read "(make-instance 'test-class ...)" rather than "(make-class ...)"?

Joe Marshall said...

Thanks, Mr. Leinen,
You are right. Typo fixed.

I ought to copy and paste from a listener, but I'm lazy.

Alastair Bridgewater said...

I'd like to report a bug in your example for pstore::restore-instance. The particular use of CALL-NEXT-METHOD "should signal an error of type ERROR", as per CLHS.

In safe code, this is required to signal an error. In unsafe code, if this does not signal an error then you are in undefined consequences territory.

Joe Marshall said...

Alastair Bridgewater noticed "The particular use of CALL-NEXT-METHOD "should signal an error of type ERROR"".

Yep. That's a bug. Fixed. Thanks!