Abstract Heresies: Collate / index-list

I was talking to Arthur Gleckler last night and he mentioned that he had been making good use of a function he called index-list. This function takes two selector functions and a list of objects. The first selector extracts a key from each object, and the second selector extracts a value. A table is returned that maps the keys to a list of all the values that were associated with that key.

I had to laugh. I had written the same function a few month back. I called it collate.

Here is Arthur’s version in Scheme:

(define (index-list elements table choose-data choose-key)
  "Given a hash table ‘table’, walk a list of ‘elements’ E, using
‘choose-key’ to extract the key K from each E and ‘choose-data’ to
extract a list of data D from each E.  Store each K in ‘table’ along
with a list of all the elements of all the D for that K."
  (do-list (e elements)
    (hash-table-update!/default
     table
     (choose-key e)
     (lambda (previous) (append (choose-data e) previous))
     ’()))
  table)

And here is my version in Common Lisp:

(defun collate (list &key (key #’car) (test #’eql)
                               (merger (merge-adjoin :test #’eql)) (default nil))
  (let ((table (make-hash-table :test test)))
    (dolist (element list table)
      (let ((key (funcall key element)))
        (setf (gethash key table)
              (funcall merger (gethash key table default) element))))))

So how do they differ?

Arthur’s version takes the hash table as a parameter. This allows the caller to control the hash table’s properties. My version creates a hash table using the test parameter, which defaults to eql.
Arthur’s version uses choose-key to extract the key from each element. My version uses key, which is a keyword parameter defaulting to car. My choice was driven by the convention of Common Lisp sequence functions to take a :key parameter.
Arthur’s version uses a default value of ’() for the entries in the hash table. My version uses the :default keyword argument that defaults to ’().
Arthur’s version uses choose-data to extract the datum in each element. My version uses the :merger keyword argument to specify how to merge the entire element into the table. If you only want a subfield of the element, you can compose a selector function with a merger function.
Arthur’s version uses append to collect the data associated with each element. My version uses a merger function to merge the element into the entry in the hash table. The default merger is merge-adjoin, which uses adjoin to add the element to the list of elements associated with the key. merge-adjoin is paramterized by a test function that defaults to eql. If the test is true, the new data is not merged, so the result of (merge-adjoin #’eql) is a list with no duplicates.
If you instead specify a default of 0 and a merger of (lambda (existing new) (+ existing 1)) you get a histogram.
Another merger I make use of is merge-unique, which ensures that all copies of the data being merged are the same, raising a warning if they are not.
Finally, I occasionally make use of a higher-order merger called merge-list that takes a list of mergers and applies them elementwise to two lists to be merged. This allows you to create a singleton aggregate merged element where the subfields are merged using different strategies.

Like Arthur, I found this to be a very useful function. I was auditing a data set obtained from GitHub. It came in as a flat list of records of users. Each record was a list of GitHub org, GitHub ID, and SAML/SSO login. Many of our users inadvertently have multiple GitHub IDs associated with their accounts. I used my collate function to create a table that mapped SAML/SSO login to a list of all the GitHub IDs associated with that login, and the list of orgs where that mapping applies.

Abstract Heresies

Tuesday, March 4, 2025

Collate / index-list

1 comment: