I was talking to Arthur Gleckler last night and he mentioned that
he had been making good use of a function he
called index-list
. This function takes two selector
functions and a list of objects. The first selector extracts a key
from each object, and the second selector extracts a value. A table
is returned that maps the keys to a list of all the values that were
associated with that key.
I had to laugh. I had written the same function a few month back.
I called it collate
.
Here is Arthur’s version in Scheme:
(define (index-list elements table choose-data choose-key) "Given a hash table ‘table’, walk a list of ‘elements’ E, using ‘choose-key’ to extract the key K from each E and ‘choose-data’ to extract a list of data D from each E. Store each K in ‘table’ along with a list of all the elements of all the D for that K." (do-list (e elements) (hash-table-update!/default table (choose-key e) (lambda (previous) (append (choose-data e) previous)) ’())) table)
And here is my version in Common Lisp:
(defun collate (list &key (key #’car) (test #’eql) (merger (merge-adjoin :test #’eql)) (default nil)) (let ((table (make-hash-table :test test))) (dolist (element list table) (let ((key (funcall key element))) (setf (gethash key table) (funcall merger (gethash key table default) element))))))
So how do they differ?
- Arthur’s version takes the hash table as a parameter. This
allows the caller to control the hash table’s properties. My
version creates a hash table using the
test
parameter, which defaults toeql
. - Arthur’s version uses
choose-key
to extract the key from each element. My version useskey
, which is a keyword parameter defaulting tocar
. My choice was driven by the convention of Common Lisp sequence functions to take a:key
parameter. - Arthur’s version uses a default value of
’()
for the entries in the hash table. My version uses the:default
keyword argument that defaults to ’(). - Arthur’s version uses
choose-data
to extract the datum in each element. My version uses the:merger
keyword argument to specify how to merge the entire element into the table. If you only want a subfield of the element, you cancompose
a selector function with a merger function. - Arthur’s version uses
append
to collect the data associated with each element. My version uses a merger function to merge the element into the entry in the hash table. The default merger ismerge-adjoin
, which usesadjoin
to add the element to the list of elements associated with the key.merge-adjoin
is paramterized by a test function that defaults toeql
. If the test is true, the new data is not merged, so the result of(merge-adjoin #’eql)
is a list with no duplicates. - If you instead specify a default of 0 and a merger of
(lambda (existing new) (+ existing 1))
you get a histogram. - Another merger I make use of is
merge-unique
, which ensures that all copies of the data being merged are the same, raising a warning if they are not. - Finally, I occasionally make use of a higher-order merger called
merge-list
that takes a list of mergers and applies them elementwise to two lists to be merged. This allows you to create a singleton aggregate merged element where the subfields are merged using different strategies.
Like Arthur, I found this to be a very useful function. I was
auditing a data set obtained from GitHub. It came in as a flat list
of records of users. Each record was a list of GitHub org, GitHub
ID, and SAML/SSO login. Many of our users inadvertently have
multiple GitHub IDs associated with their accounts. I used
my collate
function to create a table that mapped
SAML/SSO login to a list of all the GitHub IDs associated with that
login, and the list of orgs where that mapping applies.
1 comment:
It was fun to find out that we had come up with such similar code. I use that procedure all over the place.
Post a Comment