Tuesday, March 4, 2025

Collate / index-list

I was talking to Arthur Gleckler last night and he mentioned that he had been making good use of a function he called index-list. This function takes two selector functions and a list of objects. The first selector extracts a key from each object, and the second selector extracts a value. A table is returned that maps the keys to a list of all the values that were associated with that key.

I had to laugh. I had written the same function a few month back. I called it collate.

Here is Arthur’s version in Scheme:

(define (index-list elements table choose-data choose-key)
  "Given a hash table ‘table’, walk a list of ‘elements’ E, using
‘choose-key’ to extract the key K from each E and ‘choose-data’ to
extract a list of data D from each E.  Store each K in ‘table’ along
with a list of all the elements of all the D for that K."
  (do-list (e elements)
    (hash-table-update!/default
     table
     (choose-key e)
     (lambda (previous) (append (choose-data e) previous))
     ’()))
  table)

And here is my version in Common Lisp:

(defun collate (list &key (key #’car) (test #’eql)
                               (merger (merge-adjoin :test #’eql)) (default nil))
  (let ((table (make-hash-table :test test)))
    (dolist (element list table)
      (let ((key (funcall key element)))
        (setf (gethash key table)
              (funcall merger (gethash key table default) element))))))

So how do they differ?

  • Arthur’s version takes the hash table as a parameter. This allows the caller to control the hash table’s properties. My version creates a hash table using the test parameter, which defaults to eql.
  • Arthur’s version uses choose-key to extract the key from each element. My version uses key, which is a keyword parameter defaulting to car. My choice was driven by the convention of Common Lisp sequence functions to take a :key parameter.
  • Arthur’s version uses a default value of ’() for the entries in the hash table. My version uses the :default keyword argument that defaults to ’().
  • Arthur’s version uses choose-data to extract the datum in each element. My version uses the :merger keyword argument to specify how to merge the entire element into the table. If you only want a subfield of the element, you can compose a selector function with a merger function.
  • Arthur’s version uses append to collect the data associated with each element. My version uses a merger function to merge the element into the entry in the hash table. The default merger is merge-adjoin, which uses adjoin to add the element to the list of elements associated with the key. merge-adjoin is paramterized by a test function that defaults to eql. If the test is true, the new data is not merged, so the result of (merge-adjoin #’eql) is a list with no duplicates.
  • If you instead specify a default of 0 and a merger of (lambda (existing new) (+ existing 1)) you get a histogram.
  • Another merger I make use of is merge-unique, which ensures that all copies of the data being merged are the same, raising a warning if they are not.
  • Finally, I occasionally make use of a higher-order merger called merge-list that takes a list of mergers and applies them elementwise to two lists to be merged. This allows you to create a singleton aggregate merged element where the subfields are merged using different strategies.

Like Arthur, I found this to be a very useful function. I was auditing a data set obtained from GitHub. It came in as a flat list of records of users. Each record was a list of GitHub org, GitHub ID, and SAML/SSO login. Many of our users inadvertently have multiple GitHub IDs associated with their accounts. I used my collate function to create a table that mapped SAML/SSO login to a list of all the GitHub IDs associated with that login, and the list of orgs where that mapping applies.

1 comment:

Arthur A. Gleckler said...

It was fun to find out that we had come up with such similar code. I use that procedure all over the place.