I was talking to Arthur Gleckler last night and he mentioned that
he had been making good use of a function he
called index-list
. This function takes two selector
functions and a list of objects. The first selector extracts a key
from each object, and the second selector extracts a value. A table
is returned that maps the keys to a list of all the values that were
associated with that key.
I had to laugh. I had written the same function a few month back.
I called it collate
.
Here is Arthur’s version in Scheme:
(define (index-list elements table choose-data choose-key) "Given a hash table ‘table’, walk a list of ‘elements’ E, using ‘choose-key’ to extract the key K from each E and ‘choose-data’ to extract a list of data D from each E. Store each K in ‘table’ along with a list of all the elements of all the D for that K." (do-list (e elements) (hash-table-update!/default table (choose-key e) (lambda (previous) (append (choose-data e) previous)) ’())) table)
And here is my version in Common Lisp:
(defun collate (list &key (key #’car) (test #’eql) (merger (merge-adjoin :test #’eql)) (default nil)) (let ((table (make-hash-table :test test))) (dolist (element list table) (let ((key (funcall key element))) (setf (gethash key table) (funcall merger (gethash key table default) element))))))
So how do they differ?
- Arthur’s version takes the hash table as a parameter. This
allows the caller to control the hash table’s properties. My
version creates a hash table using the
test
parameter, which defaults toeql
. - Arthur’s version uses
choose-key
to extract the key from each element. My version useskey
, which is a keyword parameter defaulting tocar
. My choice was driven by the convention of Common Lisp sequence functions to take a:key
parameter. - Arthur’s version uses a default value of
’()
for the entries in the hash table. My version uses the:default
keyword argument that defaults to ’(). - Arthur’s version uses
choose-data
to extract the datum in each element. My version uses the:merger
keyword argument to specify how to merge the entire element into the table. If you only want a subfield of the element, you cancompose
a selector function with a merger function. - Arthur’s version uses
append
to collect the data associated with each element. My version uses a merger function to merge the element into the entry in the hash table. The default merger ismerge-adjoin
, which usesadjoin
to add the element to the list of elements associated with the key.merge-adjoin
is paramterized by a test function that defaults toeql
. If the test is true, the new data is not merged, so the result of(merge-adjoin #’eql)
is a list with no duplicates. - If you instead specify a default of 0 and a merger of
(lambda (existing new) (+ existing 1))
you get a histogram. - Another merger I make use of is
merge-unique
, which ensures that all copies of the data being merged are the same, raising a warning if they are not. - Finally, I occasionally make use of a higher-order merger called
merge-list
that takes a list of mergers and applies them elementwise to two lists to be merged. This allows you to create a singleton aggregate merged element where the subfields are merged using different strategies.
Like Arthur, I found this to be a very useful function. I was
auditing a data set obtained from GitHub. It came in as a flat list
of records of users. Each record was a list of GitHub org, GitHub
ID, and SAML/SSO login. Many of our users inadvertently have
multiple GitHub IDs associated with their accounts. I used
my collate
function to create a table that mapped
SAML/SSO login to a list of all the GitHub IDs associated with that
login, and the list of orgs where that mapping applies.
It was fun to find out that we had come up with such similar code. I use that procedure all over the place.
ReplyDelete