Friday, October 18, 2013

Unexciting code

Source code control isn't supposed to be interesting or exciting for the user. Blogging about source code control isn't going to be interesting or surprising for the reader, either. I'll spare you the walkthrough. You're old enough to figure it out without me holding your hand. It'll be more interesting if I blog about the problems and bugs we encountered.

Conman was the configuration manager built on the core ChangeSafe code. The master catalog in conman holds the highest level objects being managed.
(defclass master-catalog ()
  ;; subsystems and products (PC's) in this repository
  ((products   :initform nil
               :version-technique :composite-set
               :accessor master-catalog/products)

   (subsystems :initform nil
               :version-technique :composite-set
               :accessor master-catalog/subsystems)

   ;; All satellite repositories known to this master.  We don't allow
   ;; satellite repository names to change, though the projects they
   ;; contain may be renamed.  **WARNING** BE CAREFUL IF YOU OPERATE
   ;; in non-latest metaversion views of this list or you might try to
   ;; create a satellite which already exists.  Only update this list
   ;; using the latest metaversion.
   (satellite-repositories :initform nil :version-technique :composite-set)

   ;; Projects (a.k.a. ChangeSafe "classes") contained in satellite
   ;; repositories.  The descriptors contain the key mapping from a
   ;; project name to a satellite-repository-name.  We could almost
   ;; make this just a (project-name . satellite-name) mapping, but we
   ;; need to version project-name, and we also want to cache in the
   ;; master some information in the satellites so that we don't
   ;; always have to examine the satellites for often accessed
   ;; information.
   (classes :accessor master-catalog/classes
            :initform nil
            :version-technique :composite-set)

   ;; Cset-relationship-tuples is conceptually a list of sublists,
   ;; where each sublist is a tuple.  For every master cid which
   ;; results in the creation of satellite cids, a tuple is added
   ;; which enumerates the master cid and the satellite cids which it
   ;; caused to be created.  e.g. '((master.cid.1 satellite-1.cid.1
   ;; satellite-2.cid.1)) Because we want portable references, blah
   ;; blah blah, we actually reference DIDS of CHANGE-SET objects
   ;; rather than the cids.  We may instead wish to store CID-OBJECT
   ;; references.  TBD.

   ;; Right now, this information is maintained only for change
   ;; transactions which arise from WITH-CM-MASTER-TXN and
   ;; WITH-CM-SATELLITE-TXN.  This is ok, since those are the
   ;; interesting txns which manipulate satellite databases.

   ;; Note that because of the high volume of csets we expect to
   ;; track, we actually represent this information as a vector of
   ;; vectors to achieve space compaction.
   (cset-relationship-tuples :initform (make-instance 'persistent-vector
                                                           :initial-element nil
                                                           :size 1)
                             :version-technique :nonversioned)

   (cset-rel-tuples-index :initform (make-instance 'persistent-vector
                                                        :initial-element -1
                                                        :size 1)
                          :version-technique :nonversioned)

   ;; BOTH these slots are updated ONLY by vm-txn-note-change-set,
   ;; except for schema upgrading.

   ;; The cset-rel-tuples-index slot is a conceptual hash table into the
   ;; cset-relationship-tuples slot. This is used by
   ;; master-catalog-lookup-cset-relationship
   ;; to avoid an extremely costly linear search of cset-relationship-tuples.
   ;; This is very important for cset_add, cset_remove, and csets_from_file.

   ;; The idea is that the did-string of the master-cid's did is hashed.
   ;; Reducing that hash modulo the number of entries in cset-rel-tuples-index,
   ;; finds a "home" index of cset-rel-tuples-index. Using the sb32 value
   ;; in that element, we either have a -1 (so the entry is not in the
   ;; hash table) or we get an index into cset_relationship_tuples.
   ;; If there is no hash collision, that element of cset_relationship_tuples
   ;; will contain the desired master-cid did we are looking for. If it
   ;; isn't the one we want, we have had a hash collision, and we resolve it
   ;; by linear probing in the next-door (circularly) element of
   ;; cset-rel-tuples-index.
   ;; The number of elements of cset-rel-tuples-index is always a prime number,
   ;; and is also maintained to be more than twice as large as the number of
   ;; entries in cset-relationship-tuples. That is important, to prevent
   ;; clustering and slow searching. So when it grows, cset-rel-tuples-index
   ;; grows by a reasonable factor (about 2) so that it always contains
   ;; at least half "holes", that is, -1.  Further, we want to avoid frequent
   ;; growth, because growing requires computing every entry in the hash table
   ;; again. That makes for a big transaction, as every element of the
   ;; cid-relationship-tuple vector has to be mapped in, and rehashed with
   ;; the new size of cset-rel-tuples-index.
   ;; Space considerations: In Jack's db, there are roughly 40,000 elements
   ;; currently in the cset-relationship-tuples.  Suppose we had 100,000
   ;; elements. In Jack's db, it appears that the tuples are about 2 elements
   ;; each, average. Suppose it were 9. Then the tuples would take 4*(1+9)=40
   ;; bytes each, so 40*100,000 = 4Mb total (plus another 400,000 for the
   ;; cset-relationship-tuples vector itself).  This is large, but not likely
   ;; to be a cause of breakage anytime soon.

   ;; The cset-name-hashtable maps cset names to csets.  While the HP
   ;; model of ChangeSafe doesn't allow changing the name of a cset,
   ;; we allow this in general.  So this hash table is keyed by cset
   ;; name, and valued by all csets which EVER bore that name in their
   ;; versioned name component.  The hash value is therefore a list.
   ;; In the case of system augmented names (by
   ;; change_create/master_change), there shouldn't be any collisions.
   ;; We also use this slot to hash unaugmented user names to csets,
   ;; and those are far more likely to have collisions (one key ->
   ;; multiple csets).  In the case of un-augmented names, this is
   ;; expected. In the case of augmented names, this is an error.
   (cset-name-hashtable :version-technique nil
                        :initform (make-instance 'persistent-hash-table :size 1023)
                        :reader master-catalog/cset-name-hashtable)
  (:documentation "Catalog/hierarchy-root of versioned information maintained in the master repository.")
  (:metaclass versioned-standard-class)
  (:schema-version 0))
Pretty straightforward, no? No? Let's list the products in the master catalog:
(defun cmctl/list-products (conman-request)
  "cheesy hack to list the products"
  (let ((master-repository-name (conman-request/repository-dbpath conman-request))
        (reason "list products")
        (userid (conman-request/user-name conman-request)))
     :master-metaversion :latest-metaversion
     :version :latest-version
     :reason reason
     :transaction-type :read-only
     :receiver (lambda (master-repository master-transaction master-catalog)
                 (declare (ignore master-repository master-transaction))
                 (collect 'list
                          (map-fn 't (lambda (product)
                                       (list (distributed-object-identifier product)
                                             (named-object/name product)
                                             (described-object/description product)))
                                  (master-catalog/scan-products master-catalog)))))))
That isn't very edifying.

Start from the end:
(master-catalog/scan-products master-catalog)
defined as
(defun master-catalog/scan-products (master-catalog)
  (declare (optimizable-series-function))
  (versioned-object/scan-composite-versioned-slot master-catalog 'products))
The optimizable-series-function declaration indicates that we are using Richard Waters's series package. This allows us to write functions that can be assembled into an efficient pipeline for iterating over a collection of objects. This code:
(collect 'list
   (map-fn 't (lambda (product)
                (list (distributed-object-identifier product)
                      (named-object/name product)
                      (described-object/description product)))
           (master-catalog/scan-products master-catalog)))
takes each product in turn, creates a three element list of the identifier, the project name, and the product description, and finally collects the three-tuples in a list to be returned to the caller. Here is what it looks like macroexpanded:
    (SETQ #:OUT-914 'PRODUCTS)
    (COMMON-LISP:LET (#:OUT-913 #:OUT-912)
      (SETQ #:OUT-913 (SLOT-VALUE-UNVERSIONED #:OUT-917 #:OUT-914))
      (SETQ #:OUT-912
                   (DEBUG-MESSAGE 4 "Using override cid-set to scan slot ~s"
      (COMMON-LISP:LET (#:OUT-911 #:OUT-910 #:OUT-909)
        (MULTIPLE-VALUE-SETQ (#:OUT-911 #:OUT-910 #:OUT-909)
          (CVI-GET-ION-VECTOR-AND-INDEX #:OUT-913 #:OUT-912))
        (IF #:OUT-911
             (IF (COMMON-LISP:LET ((#:G717-902 #:OUT-910))
                   (IF #:G717-902
                       (THE T (CVI/DEFAULT-ALLOWED #:OUT-913))))
                  (SLOT-UNBOUND (CLASS-OF #:OUT-917) #:OUT-917 #:OUT-914)))))
        (DEBUG-MESSAGE 5 "Active ion vector for retrieval:  ~s" #:OUT-911)
        (COMMON-LISP:LET (#:OUT-908)
          (SETQ #:OUT-908
                  (IF #:OUT-911
                      (THE T #())))
          (COMMON-LISP:LET (#:ELEMENTS-907
                            (#:LIMIT-905 (ARRAY-TOTAL-SIZE #:OUT-908))
                            (#:INDEX-904 -1)
                            (#:INDEX-903 (- -1 1))
                            (#:LASTCONS-897 (LIST NIL))
                     (TYPE SERIES::-VECTOR-INDEX+ #:INDEX-904)
                     (TYPE INTEGER #:INDEX-903)
                     (TYPE CONS #:LASTCONS-897)
                     (TYPE LIST #:LST-898))
            (SETQ #:LST-898 #:LASTCONS-897)
              (INCF #:INDEX-904)
               (IF (= #:INDEX-904 #:LIMIT-905)
                   (GO SERIES::END))
               (SETQ #:ELEMENTS-907
                       (ROW-MAJOR-AREF #:OUT-908
                                       (THE SERIES::VECTOR-INDEX
              (INCF #:INDEX-903)
              (IF (MINUSP #:INDEX-903)
                  (GO #:LL-918))
              (SETQ #:ITEMS-915
                      ((LAMBDA (ION-SOUGHT)
                          (SVREF #:OUT-909 ION-SOUGHT) ION-SOUGHT))
              (SETQ #:ITEMS-900
                      ((LAMBDA (PRODUCT)
                               (NAMED-OBJECT/NAME PRODUCT)
                               (DESCRIBED-OBJECT/DESCRIPTION PRODUCT)))
              (SETQ #:LASTCONS-897
                      (SETF (CDR #:LASTCONS-897) (CONS #:ITEMS-900 NIL)))
              (GO #:LL-918)
            (CDR #:LST-898)))))))
To be continued...

Sunday, October 13, 2013

Next step

We build a simple project/branch/version hierarchical abstraction, and we implement it on top of the core layer.

  • a project is a collection of branches that represent alternative ways the state evolves. Every project has a main branch.
  • A branch is a collection of versions that represent the evolution of the branch state over time.
  • A version is a collection of change-sets with some trappings.
  • A change-set is our unit of change.
When we do a read/write transaction, we'll add a new change set to the repository. Read-only transactions will specify a version for reading the core objects. Or two versions for generating diffs. Or maybe even three versions for a three-way merge.

Under this development model, developers will sync their workspace with the most chronologically recent version of the main branch. Their new change sets will be visible only to them, unless (until, we hope) they are "promoted" into the development branch.

We wanted to encourage frequent incremental check-ins. We wanted hook this up to Emacs autosave. Frequent check-ins of much smaller diffs breaks even.

Once you get to the point where your code passes all the tests, you promote it into the branch so everyone can use it. Every now and then the admins will "fork a relase", do some "cherrypicks" and make a release branch in the project.

Once you get to the point where your code passes all the tests, you promote it into the branch so everyone can use it. Every now and then the admins will "fork a relase", do some "cherrypicks" and make a release branch in the project.

The core code does all the heavy lifting, this is window dressing. The style: minimalist. Skin the change model.

This is turning into the code equivalent of a power-point lecture. You can read it yourself, I'm not going to walk you through it.

So that's it? Is that all there is? Many source code or change management systems do something more or less similar to this with files and directory trees instead of CLOS objects. Cute hack, but why all the fuss? Hasn't this been done?

If it seems obvious to you how we'd implement some of the usual source code control operations, good. We don't have to train you.

I wont insult your intelligence explaining in detail how to do a rollback by removing a change-set from a version. Use SETF, of course.

I hope nobody notices the 300 lb chicken sitting next to that shady-looking egg in the corner.

And something for philosopher/implementors to worry about: If I demote the change-set that instantiates an object, where does the object go? Is it possible create a reference to an uninstantiated object? What happens if you try to follow it?

Monday, September 23, 2013

Putting it all together

A ChangeSafe repository is implemented as a transient wrapper object around a persistent object. The wrapper object caches some immutable metadata. You'd hate to have to run a transaction in the middle of the print function in order to print the repository name. The wrapper also contains metadata associated with the backing store that the repository is using.

Oh yeah, there is something interesting going on in the wrapper, We keep track of the ongoing transactions by mapping the user-id to a list of transaction contexts (every nested transaction by a user "pushes" a new txn-context).

Anyway, it's the repository-persistent-information that has the interesting stuff:
(defclass repository-persistent-information ()
   (type  :initarg :type
          :initform (error "Required initarg :type omitted.")
          :reader repository-persistent-information/type
          :type repository-type)

   ;; Database parent is the root extent for an extent database, or the master database for a satellite.
   ;; Root extents or master repositories won't have a parent
   (parent-repository :initarg :parent-repository
                      :initform nil
                      :reader repository-persistent-information/parent
                      :type (optional relative-pathname))

   ;; Satellite repositories is non-nil only for master repositories.
   (satellite-repositories :initform nil
                           :initarg :satellite-repositories
                           :accessor repository-persistent-information/satellite-repositories)

   (canonical-class-dictionary :initform (make-instance 'canonical-class-dictionary)
                               :reader repository-persistent-information/canonical-class-dictionary)
   (cid-master-table :initform (make-instance 'cid-master-table)
                     :reader repository-persistent-information/cid-master-table)
   (root-mapper  :initarg :root-mapper
                 :initform (error "Required initarg :root-mapper omitted.")
                 :reader repository-persistent-information/root-mapper)
   (cid-mapper   :initarg :cid-mapper
                 :initform (error "Required initarg :cid-mapper omitted.")
                 :reader repository-persistent-information/cid-mapper)
   (local-mapper :initarg :local-mapper
                 :initform (error "Required initarg :local-mapper omitted.")
                 :reader repository-persistent-information/local-mapper)
   (locally-named-roots :initarg :locally-named-roots
                        :initform (error "Required initarg :locally-named-roots omitted.")
                        :reader repository-persistent-information/locally-named-roots)
   (anonymous-user :initarg :anonymous-user
                   :initform nil
                   :reader repository-persistent-information/anonymous-user))
  (:default-initargs :node-id +object-id-of-root+)  ;; force this to always be the root object.
  (:documentation "Persistent information describing a repositiory, and stored in the repository")
  (:metaclass persistent-standard-class)
  (:schema-version 0))

The repository-type is just a keyword:
(defconstant *repository-types* '(:basic :master :satellite :transport :extent :workspace)
  "Type of repositories.  Note that all but :EXTENT types of repositories
   serve as root extents for databases which have multiple extents, and therefore imply extent.")
The parent-repository and the
satellite-repositories are for juggling multiple "satellite" repositories for holding particular subsets of changes (for, say, geographically distributing the servers for different product groups).

The canonical-class-dictionary is an intern table for objects.

The cid-master-table is (logically) the collection of audit-records. A CID (after change id) is represented as an integer index into the master table.

The root-mapper is a mapping table from distributed identifiers to objects.

The cid-mapper is a mapping table from the distributed identifier that represents the CID to the integer index of that CID in the master table. It is a subtable of the local mapper.

The local-mapper is submapping of the root-mapping, but a supermapping of the cid-mapper.

The locally-named-rootsis a hash table for storing the root objects of the repository.

Finally, there is the anonymous-user slot, which is the user id assigned for bootstrapping.

And all this crap is in support of this procedure:
(defun call-with-repository-transaction (&key repository 

                                              ;; generally, you only want to specify these two
                                              ;; but if you are doing a comparison,
                                              ;; specify these as well

  (check-type user-id-specifier (or keyword distributed-identifier))
  (check-type transaction-type repository-transaction-type)
  (check-type reason string)
  ;; implementation omitted for brevity, ha ha

Naturally we need to specify the :repository, the :transaction-type is one of
(defconstant *repository-transaction-types* '(:read-only
The :user-id-specifier should be a distributed-identifier of a core-user instance.

The :reason is a human readable string describing the transaction.

The :meta-cid-set-specifier is mumble, mumble... just a sec...

The :cid-set-specifier is how you specify which CIDs will form the basis view for the transaction. We allow this to be a procedure that returns a cid-set object, and we will call this procedure as we are setting up the transaction and use the :meta-cid-set-specifier to specify the CIDs to form the versioned view the procedure will see.

The :meta-cid-set-specifier can be the symbol :latest-metaversion, a timestamp, or a cid-set. :latest-metaversion means to use all CIDS while resolving the :cid-set-specifier, a timestamp is useful for rewinding the world, and the main use for using an explicit cid-set is for synchronizing views between master and satellite repositories.

The :receiver is invoked within the dynamic extent of a transaction. It is passed a core-txn object that contains the metadata associated with the transaction.

The ChangeSafe core components are the repository that holds changes and associated meta-information, and simple versioned CLOS objects. It is only useful as a foundation layer, though.

Next up, another level of abstraction...

Tuesday, September 10, 2013

Mix in a little CLOS

The obvious idea here is to make CLOS objects where the slots are implemented as versioned value objects. Then we override slot-value-using-class. You might consider this a stupid CLOS trick. You could just as well establish an abstraction layer through other means, but the point is to create an understandable abstraction model. It is easy to understand what is going to happen if we override slot-value-using-class.

We use the MOP to create a new kind of slot so that we can compose values on the fly when the programmer calls slot-value-using-class. We also override (setf slot-value-using-class) so that it calls the "diff" computing code. Again, the point is to make it easy to understand what is happening.

The end result is the versioned-standard-object. An instance of a versioned-standard-object (or any of it's inheritors, naturally), has all its slots implemented versioned value objects. The programmer should specify versioned-standard-class as the metaclass in the class definition.

(defclass test-class ()
  ((nvi-slot  :version-technique :nonlogged
              :accessor test-class/nvi-slot)
   (lnvi-slot :version-technique :logged
              :accessor test-class/lnvi-slot)
   (svi-slot  :version-technique :scalar
              :accessor test-class/svi-slot))
  (:metaclass versioned-standard-class)
  (:schema-version 0))
In this example, the test class has some of the different kinds of versioned values that are named by the version technique. A :nonlogged slot is the "escape mechanism". It's a fancy name for "Just turn off the versioning, and use this here value."

A :logged slot is less drastic. There's no versioning behavior, it's just a persistent slot, but we'll keep a list of the transactions that modified it.

Finally, the :scalar version technique is one where the last chronologically participating change has the value.

A versioned slot using the :composite-sequence uses a set of diffs to represent the versioned slot value, and these are composed as described in an earlier post.
(defclass test-cvi-class ()
  ((cvi-slot-a :version-technique :composite-sequence
               :accessor test-cvi-class/cvi-slot-a)
   (cvi-slot-b :version-technique :composite-sequence)
   (cvi-slot-c :version-technique :composite-sequence :initform nil))
  (:metaclass versioned-standard-class)
  (:schema-version 0))

Once this is working, we have what we need to bootstrap the rest of the versioned storage.

Monday, September 2, 2013

Putting things together

Ok, so we have audit records, a persistent store, and "diffs". Let's start putting them together. Naturally, we are going to keep the audit records in the persistent store, and we'll put the diffs in the audit records.

A versioned value is the abstraction we're aiming for. We're going to create a versioned value by combining the information held in the audit records. If the information is a set of insertion and deletion records, we combine them as I described in the previous posts.

What makes this interesting is that we can specify a subset of the audit records to participate in the construction of the value. We can extract the versioned value as it appeared at any point in time by specfying only those records that have a timestamp at or before that point. We can also synthesize interesting views by omitting some of the records.

We're going to store a lot of these versioned values and we'll use many of them every time we access the store. To get any kind of coherent view of the world, we want to use a single set of audit records when we view these values. But programmers, being who they are, won't want to think about this. So here's what we'll do: we already have transactions in order to talk to the store; we'll add a field to the transaction that specifies the audit records to be used during that transaction. Pretty simple. You want to look at the world as it was on July 4th, you start the transaction with those audit records dated July 4th or earlier and use that set for every versioned value that you want to look at. It would be crazy to look at some objects as if it were July 4th, but others as if it were December. (Heh, but on occasion....)

There is another reason we want to specify a set of audit records at the beginning of a transaction: we need to know the baseline that we compute our diffs against. When we do a read/write transaction we're going to modify some of our versioned values. When the transaction ends, we need to compute the diffs of the things we modified. We compute the diffs relative to the view we established at the beginning of the transaction.

So we need to modify our audit records to record the basis set of records we use when we begin a transaction. We modify the transaction API to require the programmer to specify which basis set of records are to be used for the transaction and we use that basis set for computing the diffs at the end of read/write transactions.

There is an interesting side effect of this change. Suppose we have some way of attaching a label to the transactions, and some transactions only use label 'A' and others only use label 'B'. Further transaction using label 'A' only see diffs relative to prior 'A' versions, while the 'B' transactions only see the 'B' diffs. The result is that a single versioned value can hold two completely different histories.

Friday, August 16, 2013

Plus ├ža change

Let's start with a list:
'(a b c)
We change it:
'(a b c d)
We change the result:
'(0 a b c d)
'(0 a b c)
'(a b c)
and now a big change:
'(-1 0 1 a a1 b b1 c d e f)
At each step we compute the indels for that step. If we so desire, we can reconstruct any of the intermediate lists by starting at the beginning and applying the appropriate indels in order.

But what if we skip some? We end up with a mess. Each set of indels is computed relative to the application
of the prior indels. If an indel is omitted, the indices of the subsequent indels will be wrong.

To solve this, we have to change the representation of our sequence (that is, we won't be modifying a list).
Instead, we'll represent our sequence as an ordered set of list segments that we'll concatenate. Insertion is easy — just add a segment. Deletion is slightly harder because we don't want to cross segment boundaries.

Reconstruction of an intermediate list requires more work, but we gain flexibility. We can apply a subset of the insertions and deletions and still get a meaningful result. For example, the first change we made was to create a list with the three elements '(a b c). The second change appended a 'd'. What if we apply the second change but omit the first? We append a 'd' to an empty sequence and get '(d).

How about if we apply only change 3? That inserts a '0' at the beginning giving us '(0).

If we apply change 2 and 3, still omitting 1, we get '(0 d).

That last change is tricky. We've deleted the 'a and prepended '(-1 0 1 a a1), and deleted the 'c and inserted '(b1 c d e f). If we omit change 4 and 5 (which delete the leading 0 and trailing 'd) we'll get '(-1 0 1 a a1 0 b b1 c d e f d). We preserve the order between different insertion points, so the inserted 0 is always before the inserted 'd, but we resolve ambiguous insertions by placing the later insertions before the earlier ones if they are in the same place.

Optional Exercise 7 (quite hard): Implement this as well.

Tuesday, August 13, 2013

Let's start with a list:
'(a b c d e f g h i j)

We'll add some elements:
'(a b c d e f g h i j k l m n o)

Delete some:
'(a b c d e i j k l m n o)

Add some more:
'(x y z a b c d e i j k l m n o)

Maybe delete one:
'(x y z b c d e i j k l m n o)

We want to compare the original sequence with the end sequence and
summarize the difference with a set of "indels", which specify the
indexes at where elements were inserted or deleted:

(diff '(a b c d e f g h i j) '(x y z b c d e i j k l m n o))

    :LEFT-SUBSEQ-START 0         ;; 'a'
    :RIGHT-SUBSEQ-START 0        ;; 'x y z'
    :LEFT-SUBSEQ-START 5         ;; 'f g h'
    :RIGHT-SUBSEQ-START 7        ;; nothing
    :LEFT-SUBSEQ-START 10        ;; nothing
    :RIGHT-SUBSEQ-START 9        ;; 'k l m n o'
Exercise 6 (hard): Implement this.

For an extreme challenge, implement this in a way that is not hopelessly inefficient.