Friday, August 1, 2014

Mini regex golf

I was intrigued by Peter Norvig's articles about regex golf.

To make things easier to think about, I decided to start with the simpler problem of looking for substrings. Here's code to extract the ngrams of a string:
(define (ngrams-of-length n string)
  (do ((start    0 (1+ start))
       (end      n (1+ end))
       (answer '() (lset-adjoin string=? answer (substring string start end))))
      ((> end (string-length string)) answer)))

(define (ngrams string)
  (do ((n 1 (+ n 1))
       (answer '() (append (ngrams-of-length n string) answer)))
      ((> n (string-length string)) answer)))
A solution is simply a list of ngrams. (Although not every list of ngrams is a solution!)
(define (solution? solution winners losers)
  (let ((matches-solution? (ngram-list-matcher solution)))
    (and (for-all? winners matches-solution?)
         (not (there-exists? losers matches-solution?)))))

(define (ngram-list-matcher ngram-list)
  (lambda (test-string)
    (there-exists? ngram-list 
     (lambda (ngram)
       (string-search-forward ngram test-string)))))
We also want to know if an ngram appears in a given list of strings.
(define (string-list-matcher string-list)
  (lambda (test-ngram)
    (there-exists? string-list
     (lambda (string)
       (string-search-forward test-ngram string)))))

(fluid-let ((*unparser-list-breadth-limit* 10))
    (let ((matches-loser? (string-list-matcher losers)))
      (for-each
       (lambda (winner) (write-string winner) (write-string ": ") 
        (write (reverse (delete-matching-items (ngrams winner) matches-loser?)))
        (newline))
       winners)))

washington: ("sh" "hi" "gt" "to" "was" "ash" "shi" "hin" "ngt" "gto" ...)
adams: ("ad" "am" "ms" "ada" "dam" "ams" "adam" "dams" "adams")
jefferson: ("j" "je" "ef" "ff" "fe" "rs" "jef" "eff" "ffe" "fer" ...)
madison: ("ma" "ad" "di" "mad" "adi" "dis" "iso" "madi" "adis" "diso" ...)
monroe: ("oe" "onr" "nro" "roe" "monr" "onro" "nroe" "monro" "onroe" "monroe")
jackson: ("j" "ja" "ac" "ks" "jac" "ack" "cks" "kso" "jack" "acks" ...)
van-buren: ("-" "va" "n-" "-b" "bu" "van" "an-" "n-b" "-bu" "bur" ...)
harrison: ("har" "arr" "rri" "ris" "iso" "harr" "arri" "rris" "riso" "ison" ...)
polk: ("po" "pol" "olk" "polk")
taylor: ("ta" "yl" "lo" "tay" "ayl" "ylo" "lor" "tayl" "aylo" "ylor" ...)
pierce: ("rc" "ce" "pie" "ier" "erc" "rce" "pier" "ierc" "erce" "pierc" ...)
buchanan: ("bu" "uc" "ch" "na" "buc" "uch" "cha" "ana" "nan" "buch" ...)
lincoln: ("li" "ln" "lin" "col" "oln" "linc" "inco" "ncol" "coln" "linco" ...)
grant: ("ra" "gra" "ran" "ant" "gran" "rant" "grant")
hayes: ("ye" "hay" "aye" "yes" "haye" "ayes" "hayes")
garfield: ("ga" "rf" "fi" "gar" "arf" "rfi" "fie" "iel" "eld" "garf" ...)
cleveland: ("lev" "vel" "ela" "clev" "leve" "evel" "vela" "elan" "cleve" "level" ...)
mckinley: ("nl" "mck" "inl" "nle" "mcki" "kinl" "inle" "nley" "mckin" "ckinl" ...)
roosevelt: ("oo" "os" "lt" "roo" "oos" "ose" "sev" "vel" "elt" "roos" ...)
taft: ("ta" "af" "ft" "taf" "aft" "taft")
wilson: ("ls" "ils" "lso" "wils" "ilso" "lson" "wilso" "ilson" "wilson")
harding: ("di" "har" "ard" "rdi" "din" "hard" "ardi" "rdin" "ding" "hardi" ...)
coolidge: ("oo" "li" "coo" "ool" "oli" "lid" "cool" "ooli" "olid" "lidg" ...)
hoover: ("ho" "oo" "hoo" "oov" "hoov" "oove" "hoove" "oover" "hoover")
truman: ("tr" "ru" "ma" "tru" "rum" "uma" "man" "trum" "ruma" "uman" ...)
eisenhower: ("ei" "nh" "ho" "ow" "eis" "ise" "sen" "enh" "nho" "how" ...)
kennedy: ("nn" "ed" "dy" "ken" "enn" "nne" "ned" "edy" "kenn" "enne" ...)
johnson: ("j" "jo" "oh" "hn" "joh" "ohn" "hns" "john" "ohns" "hnso" ...)
nixon: ("ni" "ix" "xo" "nix" "ixo" "xon" "nixo" "ixon" "nixon")
carter: ("rt" "car" "art" "rte" "cart" "arte" "rter" "carte" "arter" "carter")
reagan: ("ea" "ag" "ga" "rea" "eag" "aga" "gan" "reag" "eaga" "agan" ...)
bush: ("bu" "us" "sh" "bus" "ush" "bush")
clinton: ("li" "to" "cli" "lin" "int" "nto" "ton" "clin" "lint" "into" ...)
obama: ("ob" "ba" "am" "ma" "oba" "bam" "ama" "obam" "bama" "obama")
We can discard ngrams like "shi" because the shorter ngram "sh" will also match.
(define (dominant-ngrams string losing-ngram?)
  (do ((n 1 (+ n 1))
       (answer '() (append
                     (delete-matching-items
                      (ngrams-of-length n string)
                      (lambda (item)
                        (or (there-exists? answer
                                           (lambda (ngram)
                                             (string-search-forward ngram item)))
                            (losing-ngram? item))))
                    answer)))
      ((> n (string-length string)) answer)))


(fluid-let ((*unparser-list-breadth-limit* 10))
    (let ((matches-loser? (string-list-matcher losers)))
      (for-each
       (lambda (winner) (write-string winner) (write-string ": ") 
        (write (dominant-ngrams winner matches-loser?))
        (newline))
       winners)))

washington: ("was" "to" "gt" "hi" "sh")
adams: ("ms" "am" "ad")
jefferson: ("rs" "fe" "ff" "ef" "j")
madison: ("iso" "di" "ad" "ma")
monroe: ("nro" "onr" "oe")
jackson: ("ks" "ac" "j")
van-buren: ("ren" "ure" "bu" "va" "-")
harrison: ("iso" "ris" "rri" "arr" "har")
polk: ("olk" "po")
taylor: ("lo" "yl" "ta")
pierce: ("ier" "pie" "ce" "rc")
buchanan: ("na" "ch" "uc" "bu")
lincoln: ("inco" "col" "ln" "li")
grant: ("ant" "ra")
hayes: ("hay" "ye")
garfield: ("eld" "iel" "fi" "rf" "ga")
cleveland: ("ela" "vel" "lev")
mckinley: ("mck" "nl")
roosevelt: ("vel" "sev" "lt" "os" "oo")
taft: ("ft" "af" "ta")
wilson: ("ls")
harding: ("ard" "har" "di")
coolidge: ("li" "oo")
hoover: ("oo" "ho")
truman: ("ma" "ru" "tr")
eisenhower: ("wer" "sen" "ise" "ow" "ho" "nh" "ei")
kennedy: ("ken" "dy" "ed" "nn")
johnson: ("hn" "oh" "j")
nixon: ("xo" "ix" "ni")
carter: ("car" "rt")
reagan: ("ga" "ag" "ea")
bush: ("sh" "us" "bu")
clinton: ("int" "to" "li")
obama: ("ma" "am" "ba" "ob")
It's time to tackle the set cover problem. We want a set of ngrams that match all the strings. Obviously, if we pick an ngram from each of the strings we want to cover, we'll have a solution. For instance,
(let ((matches-loser? (string-list-matcher losers)))
  (solution? (delete-duplicates
                 (map
                    (lambda (winner) (car (dominant-ngrams winner matches-loser?)))
                    winners))
                winners losers))
;Value: #t
We can cycle through all the possible solutions and then select the best one.
(define (mini-golf0 winners losers)
  (lowest-scoring
   (cover0 (make-dominant-ngram-table
            winners
            (delete-losing-superstrings winners losers)))))

(define (delete-losing-superstrings winners losers)
  (delete-matching-items
   losers
   (lambda (loser)
     (there-exists? winners
                    (lambda (winner)
                      (string-search-forward winner loser))))))

(define (make-dominant-ngram-table winners losers)
  (let ((losing-ngram? (string-list-matcher losers)))
    (map (lambda (winner)
           (cons winner (dominant-ngrams winner losing-ngram?)))
         winners)))

(define (cover0 v-k-table)
  (let ((empty-solution-set (list '())))
    (fold-left add-v-k-entry0 empty-solution-set v-k-table)))

(define (add-v-k-entry0 solution-set v-k-entry)
  (let ((value (car v-k-entry))
        (keys  (cdr v-k-entry)))

    (write-string "Adding value ") (write value) (newline)
    (write-string "   with keys ") (write keys) (newline)
    (write-string "   to ") (write (length solution-set))
    (write-string " partial solutions.") (newline)

    (let ((new-solutions
           (map make-new-solution (cartesian-product solution-set keys))))

      (write-string "Returning ") (write (length new-solutions))
      (write-string " new partial solutions.") (newline)

      new-solutions)))

(define (lowest-scoring list)
  (least-elements list (lambda (l r) (< (score l) (score r)))))

(define (cartesian-product left-list right-list)
  (fold-left (lambda (answer left)
               (fold-left (lambda (answer right)
                            (cons (cons left right) answer))
                          answer
                          right-list))
             '()
             left-list))

(define (make-new-solution cp-term)
  (let ((solution (car cp-term))
        (key (cdr cp-term)))
    (lset-adjoin equal? solution key)))

(define (improper-list-error procedure thing)
  (error (string-append "Improper list found by " procedure ": ") thing))

(define (least-elements list <)
  (define (accumulate-least answer item)
    (cond ((< (car answer) item) answer)
          ((< item (car answer)) (cons item '()))
          (else (cons item answer))))

  (cond ((pair? list) (fold-left accumulate-least
                                 (cons (car list) '())
                                 (cdr list)))
        ((null? list) (error "List must have at least one element." list))
        (else (improper-list-error 'LEAST-ELEMENTS list))))

(define (score solution)
  (do ((tail solution (cdr tail))
       (score -1      (+ score (string-length (car tail)) 1)))
      ((not (pair? tail))
       (if (null? tail)
           score
           (improper-list-error 'score solution)))))
This works for small sets:
1 ]=> (mini-golf0 boys girls)
Adding value "jacob"
   with keys ("ob" "c" "j")
   to 1 partial solutions.
Returning 3 new partial solutions.
Adding value "mason"
   with keys ("as")
   to 3 partial solutions.
Returning 3 new partial solutions.
Adding value "ethan"
   with keys ("an" "ha")
   to 3 partial solutions.
Returning 6 new partial solutions.
Adding value "noah"
   with keys ("ah" "oa" "no")
   to 6 partial solutions.
Returning 18 new partial solutions.
Adding value "william"
   with keys ("lia" "lli" "ill" "am" "w")
   to 18 partial solutions.
Returning 90 new partial solutions.
Adding value "liam"
   with keys ("lia" "am")
   to 90 partial solutions.
Returning 180 new partial solutions.
Adding value "jayden"
   with keys ("en" "de" "yd" "ay" "j")
   to 180 partial solutions.
Returning 900 new partial solutions.
Adding value "michael"
   with keys ("ae" "ha" "c")
   to 900 partial solutions.
Returning 2700 new partial solutions.
Adding value "alexander"
   with keys ("de" "nd" "an" "le" "al" "r" "x")
   to 2700 partial solutions.
Returning 18900 new partial solutions.
Adding value "aiden"
   with keys ("en" "de" "id")
   to 18900 partial solutions.
Returning 56700 new partial solutions.
;Value 41: (("de" "am" "ah" "ha" "as" "j")
            ("de" "am" "ah" "ha" "as" "j")
            ("de" "am" "oa" "ha" "as" "j")
            ("de" "am" "oa" "ha" "as" "j")
            ("de" "am" "no" "ha" "as" "j")
            ("de" "am" "no" "ha" "as" "j")
            ("de" "am" "ah" "ha" "as" "c")
            ("de" "am" "ah" "ha" "as" "c")
            ("de" "am" "oa" "ha" "as" "c")
            ("de" "am" "oa" "ha" "as" "c")
            ("de" "am" "no" "ha" "as" "c")
            ("de" "am" "no" "ha" "as" "c")
            ("de" "am" "ah" "an" "as" "c")
            ("de" "am" "ah" "an" "as" "c")
            ("en" "am" "ah" "an" "as" "c")
            ("de" "am" "oa" "an" "as" "c")
            ("de" "am" "oa" "an" "as" "c")
            ("en" "am" "oa" "an" "as" "c")
            ("de" "am" "no" "an" "as" "c")
            ("de" "am" "no" "an" "as" "c")
            ("en" "am" "no" "an" "as" "c"))
But you can see that we won't be able to go much beyond this because there are just too many combinations. We can cut down on the intermediate partial solutions by noticing that many of them are redundant. We don't need to keep partial solutions that cannot possibly lead to a shortest final solution. The various partial solutions each (potentially) match different sets of words. We only need keep the shortest solution for each different set of matched words. Furthermore, if a solution's matches are a superset of another's matches, and the other is the same length or longer, then the solution is dominated by the other and will always be at least the length of the longer.
(define (mini-golf1 winners losers)
  (cover1
   (make-dominant-ngram-table winners (delete-losing-superstrings winners losers))
   lowest-scoring))

(define (cover1 v-k-table lowest-scoring)
  (let ((empty-solution-set (list '())))

    (define (add-v-k-entry solution-set v-k-entry)
      (let ((value (car v-k-entry))
            (keys  (cdr v-k-entry)))

        (write-string "Adding value ") (write value) (newline)
        (write-string "   with keys ") (write keys) (newline)
        (write-string "   to ") (write (length solution-set))
        (write-string " partial solutions.") (newline)

        (let ((new-solutions
               (map make-new-solution (cartesian-product solution-set keys))))

          (let ((trimmed-solutions (trim-partial-solutions new-solutions)))

            (write-string "Returning ") (write (length trimmed-solutions))
            (write-string " of ") (write (length new-solutions))
            (write-string " new partial solutions.") (newline)

            trimmed-solutions))))

    (define (trim-partial-solutions partial-solutions)
      (let ((equivalent-solutions (collect-equivalent-partial-solutions partial-solutions)))
        (write-string "  Deleting ")
        (write (- (length partial-solutions) (length equivalent-solutions)))
        (write-string " equivalent partial solutions.")
        (newline)

        (remove-dominated-solutions
         (map lowest-scoring-equivalent-partial-solution equivalent-solutions))))

    (define (lowest-scoring-equivalent-partial-solution entry)
      (first (lowest-scoring (car entry))))

    (define (collect-equivalent-partial-solutions alist)
      ;; Add each entry in turn.
      (fold-left (lambda (equivalents partial-solution)
                   (add-equivalent-partial-solution
                    partial-solution
                    (partial-solution-matches partial-solution)
                    equivalents))
                 '() alist))

    (define (partial-solution-matches partial-solution)
      (keep-matching-items v-k-table
        (lambda (entry)
          (there-exists? partial-solution
                         (lambda (key) (member key (cdr entry)))))))

    (define (remove-dominated-solutions partial-solutions)
      (let ((before-length (length partial-solutions)))
        (let ((answer  (map car (fold-left (lambda (answer solution)
                                             (if (there-exists? answer (dominates-solution? solution))
                                                 answer
                                                 (cons solution answer)))
                                           '()
                                           (map (lambda (partial-solution)
                                                  (cons partial-solution (partial-solution-matches partial-solution)))
                                                partial-solutions)))))
          (let ((after-length (length answer)))
            (write-string "  Deleting ") (write (- before-length after-length))
            (write-string " dominated solutions.")
            (newline)
            answer))))

    (lowest-scoring
     (fold-left add-v-k-entry empty-solution-set v-k-table))))

(define (dominates-solution? solution)
  (let ((partial-solution (car solution))
        (solution-matches (cdr solution)))
    (lambda (other-solution)
      (let ((other-partial-solution (car other-solution))
            (other-matches (cdr other-solution)))
        (and (not (equal? solution-matches other-matches))
             (superset? other-matches solution-matches)
             (<= (score other-partial-solution) (score partial-solution)))))))

(define (add-equivalent-partial-solution solution value alist)
  (cond ((pair? alist)
         (let ((entry (car alist))
               (tail (cdr alist)))
           (let ((entry-solutions (car entry))
                 (entry-value (cdr entry)))
             (if (equal? value entry-value)
                 (if (member solution entry-solutions)
                     alist
                     (cons (cons (cons solution entry-solutions) value)
                           tail))
                 (cons entry (add-equivalent-partial-solution solution value tail))))))
        ((null? alist) (list (cons (list solution) value)))
        (else (improper-list-error 'collect-equivalents alist))))
1 ]=> (mini-golf1 winners losers)
Adding value "washington"
   with keys ("was" "to" "gt" "hi" "sh")
   to 1 partial solutions.
  Deleting 2 equivalent partial solutions.
  Removing 1 dominated solutions.
Returning 2 of 5 new partial solutions.
Adding value "adams"
   with keys ("ms" "am" "ad")
   to 2 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 2 dominated solutions.
Returning 4 of 6 new partial solutions.
Adding value "jefferson"
   with keys ("rs" "fe" "ff" "ef" "j")
   to 4 partial solutions.
  Deleting 12 equivalent partial solutions.
  Removing 4 dominated solutions.
Returning 4 of 20 new partial solutions.
Adding value "madison"
   with keys ("iso" "di" "ad" "ma")
   to 4 partial solutions.
  Deleting 2 equivalent partial solutions.
  Removing 2 dominated solutions.
Returning 12 of 16 new partial solutions.
Adding value "monroe"
   with keys ("nro" "onr" "oe")
   to 12 partial solutions.
  Deleting 24 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 12 of 36 new partial solutions.
Adding value "jackson"
   with keys ("ks" "ac" "j")
   to 12 partial solutions.
  Deleting 24 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 12 of 36 new partial solutions.
Adding value "van-buren"
   with keys ("ren" "ure" "bu" "va" "-")
   to 12 partial solutions.
  Deleting 36 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 24 of 60 new partial solutions.
Adding value "harrison"
   with keys ("iso" "ris" "rri" "arr" "har")
   to 24 partial solutions.
  Deleting 96 equivalent partial solutions.
  Removing 12 dominated solutions.
Returning 12 of 120 new partial solutions.
Adding value "polk"
   with keys ("olk" "po")
   to 12 partial solutions.
  Deleting 12 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 12 of 24 new partial solutions.
Adding value "taylor"
   with keys ("lo" "yl" "ta")
   to 12 partial solutions.
  Deleting 12 equivalent partial solutions.
  Removing 12 dominated solutions.
Returning 12 of 36 new partial solutions.
Adding value "pierce"
   with keys ("ier" "pie" "ce" "rc")
   to 12 partial solutions.
  Deleting 36 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 12 of 48 new partial solutions.
Adding value "buchanan"
   with keys ("na" "ch" "uc" "bu")
   to 12 partial solutions.
  Deleting 39 equivalent partial solutions.
  Removing 3 dominated solutions.
Returning 6 of 48 new partial solutions.
Adding value "lincoln"
   with keys ("inco" "col" "ln" "li")
   to 6 partial solutions.
  Deleting 15 equivalent partial solutions.
  Removing 6 dominated solutions.
Returning 3 of 24 new partial solutions.
Adding value "grant"
   with keys ("ant" "ra")
   to 3 partial solutions.
  Deleting 3 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 3 of 6 new partial solutions.
Adding value "hayes"
   with keys ("hay" "ye")
   to 3 partial solutions.
  Deleting 3 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 3 of 6 new partial solutions.
Adding value "garfield"
   with keys ("eld" "iel" "fi" "rf" "ga")
   to 3 partial solutions.
  Deleting 9 equivalent partial solutions.
  Removing 3 dominated solutions.
Returning 3 of 15 new partial solutions.
Adding value "cleveland"
   with keys ("ela" "vel" "lev")
   to 3 partial solutions.
  Deleting 3 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 6 of 9 new partial solutions.
Adding value "mckinley"
   with keys ("mck" "nl")
   to 6 partial solutions.
  Deleting 6 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 6 of 12 new partial solutions.
Adding value "roosevelt"
   with keys ("vel" "sev" "lt" "os" "oo")
   to 6 partial solutions.
  Deleting 24 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 6 of 30 new partial solutions.
Adding value "taft"
   with keys ("ft" "af" "ta")
   to 6 partial solutions.
  Deleting 12 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 6 of 18 new partial solutions.
Adding value "wilson"
   with keys ("ls")
   to 6 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 6 of 6 new partial solutions.
Adding value "harding"
   with keys ("ard" "har" "di")
   to 6 partial solutions.
  Deleting 12 equivalent partial solutions.
  Removing 2 dominated solutions.
Returning 4 of 18 new partial solutions.
Adding value "coolidge"
   with keys ("li" "oo")
   to 4 partial solutions.
  Deleting 4 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 4 of 8 new partial solutions.
Adding value "hoover"
   with keys ("oo" "ho")
   to 4 partial solutions.
  Deleting 4 equivalent partial solutions.
  Removing 2 dominated solutions.
Returning 2 of 8 new partial solutions.
Adding value "truman"
   with keys ("ma" "ru" "tr")
   to 2 partial solutions.
  Deleting 4 equivalent partial solutions.
  Removing 1 dominated solutions.
Returning 1 of 6 new partial solutions.
Adding value "eisenhower"
   with keys ("wer" "sen" "ise" "ow" "ho" "nh" "ei")
   to 1 partial solutions.
  Deleting 6 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 7 new partial solutions.
Adding value "kennedy"
   with keys ("ken" "dy" "ed" "nn")
   to 1 partial solutions.
  Deleting 3 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 4 new partial solutions.
Adding value "johnson"
   with keys ("hn" "oh" "j")
   to 1 partial solutions.
  Deleting 2 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 3 new partial solutions.
Adding value "nixon"
   with keys ("xo" "ix" "ni")
   to 1 partial solutions.
  Deleting 2 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 3 new partial solutions.
Adding value "carter"
   with keys ("car" "rt")
   to 1 partial solutions.
  Deleting 1 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 2 new partial solutions.
Adding value "reagan"
   with keys ("ga" "ag" "ea")
   to 1 partial solutions.
  Deleting 2 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 3 new partial solutions.
Adding value "bush"
   with keys ("sh" "us" "bu")
   to 1 partial solutions.
  Deleting 2 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 3 new partial solutions.
Adding value "clinton"
   with keys ("int" "to" "li")
   to 1 partial solutions.
  Deleting 2 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 3 new partial solutions.
Adding value "obama"
   with keys ("ma" "am" "ba" "ob")
   to 1 partial solutions.
  Deleting 3 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 4 new partial solutions.
;Value 47: (("rt" "ni" "nn" "ho" "ls" "nl" "vel" "ga" "ye" "ra" "li" "rc" "ta" "po" "har" "bu" "oe" "ma" "j" "ad" "sh"))
The cover procedure takes a table that maps values to the keys that cover them. If we can reduce the size of that table without changing the solution, we'll run faster. If there are two entries in the table such that the keys of one are a superset of the keys of the other, we can discard the superset: the smaller of the two entries will be in the solution, and any key that matches the smaller one will automatically match the larger one as well. Also, if two values have the same set of keys that match them, we need only include one of the values in the table.
(define (delete-dominated-values v-k-table)
  (let ((size-before (length v-k-table)))

    (define (dominated-value? entry)
      (let ((entry-value (car entry))
            (entry-keylist (cdr entry)))
        (there-exists? v-k-table
          (lambda (other-entry)
            (and (not (eq? entry other-entry))
                 (let ((other-value (car other-entry))
                       (other-keylist (cdr other-entry)))
                   (and (superset? entry-keylist other-keylist)
                        (not (equal? other-keylist entry-keylist)))))))))

    (define (equivalent-value-in-answer? answer entry)
      (let ((entry-value (car entry))
            (entry-keylist (cdr entry)))
        (there-exists? answer
          (lambda (other-entry)
            (let ((other-value (car other-entry))
                  (other-keylist (cdr other-entry)))
              (equal? entry-keylist other-keylist))))))

    (define (add-entry answer entry)
      (if (or (equivalent-value-in-answer? answer entry)
              (dominated-value? entry))
          answer
          (cons entry answer)))

    (let ((answer (fold-left add-entry '() v-k-table)))
      (write-string "Removed ") (write (- size-before (length answer)))
      (write-string " dominated and equivalent values.")
      (newline)
      answer)))

(define (superset? bigger smaller)
  (for-all? smaller (lambda (s) (member s bigger))))

(define (mini-golf2 winners losers)
  (cover1
   (delete-dominated-values
    (make-dominant-ngram-table winners (delete-losing-superstrings winners losers)))
   lowest-scoring))

;;;;;;;;
;; Delete dominated keys from the keylists.

(define (mini-golf3 winners losers)
  (cover1
   (delete-dominated-keys-and-values
    (make-dominant-ngram-table winners (delete-losing-superstrings winners losers))
    (lambda (left right)
      (or (< (string-length left) (string-length right))
          (and (= (string-length left) (string-length right))
               (string<? left right)))))
   lowest-scoring))

(define (delete-dominated-keys-and-values v-k-table better-key)
  (let ((before-size (fold-left * 1 (map length v-k-table))))
    (let ((new-table (delete-dominated-values
                      (delete-dominated-keys v-k-table better-key))))
      (let ((after-size (fold-left * 1 (map length new-table))))
        (if (= before-size after-size)
            v-k-table
            (delete-dominated-keys-and-values new-table better-key))))))

(define (delete-dominated-keys v-k-table better-key)
  (let ((all-keys (get-all-keys v-k-table)))

    (define (lookup-key key)
      (cons key
            (map car
                 (keep-matching-items v-k-table
                                      (lambda (v-k-entry)
                                        (member key (cdr v-k-entry)))))))

    (let ((k-v-table (map lookup-key all-keys)))

      (define (dominated-key? key)
        (let ((values (cdr (assoc key k-v-table))))
          (there-exists? k-v-table
                         (lambda (entry)
                           (let ((entry-key (car entry))
                                 (entry-values (cdr entry)))
                             (and (superset? entry-values values)
                                  (not (equal? values entry-values))
                                  (or (< (string-length entry-key) (string-length key))
                                      (and (= (string-length entry-key) (string-length key))
                                           (string<? entry-key key)))))))))

      (define (equivalent-key-in-answer? answer key)
        (let ((values (cdr (assoc key k-v-table))))
          (there-exists? answer
                         (lambda (entry-key)
                           (let ((entry-values (cdr (lookup-key entry-key))))
                             (equal? values entry-values))))))

      (define (add-keys answer key)
        (if (or (dominated-key? key)
                (equivalent-key-in-answer? answer key))
            answer
            (cons key answer)))

      (let ((good-keys (fold-left add-keys '() (sort all-keys better-key))))
        (write-string "Removed ") (write (- (length all-keys) (length good-keys)))
        (write-string " of ") (write (length all-keys)) (write-string " keys.")(newline)

        (map (lambda (entry)
               (cons (car entry)
                     (keep-matching-items (cdr entry) (lambda (key) (member key good-keys)))))
             v-k-table)))))

(define (get-all-keys v-k-table)
  (fold-left (lambda (answer entry)
               (fold-left (lambda (answer key)
                            (lset-adjoin equal? answer key))
                          answer
                          (cdr entry)))
             '()
             v-k-table))
Trimming the table this way helps a lot. We can now compute the dogs vs. cats.
1 ]=> (mini-golf3 dogs cats)

Removed 294 of 405 keys.
Removed 44 dominated and equivalent values.
Removed 25 of 93 keys.
Removed 15 dominated and equivalent values.
Removed 7 of 62 keys.
Removed 0 dominated and equivalent values.
Removed 0 of 55 keys.
Removed 0 dominated and equivalent values.
Adding value "BORZOIS"
   with keys ("OIS" "BOR" "RZ")
   to 1 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 3 of 3 new partial solutions.
Adding value "GIANT SCHNAUZERS"
   with keys ("SCH" "HN")
   to 3 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 6 of 6 new partial solutions.
Adding value "BASENJIS"
   with keys ("JI")
   to 6 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 6 of 6 new partial solutions.
Adding value "ENGLISH SETTERS"
   with keys ("TERS" "ETT")
   to 6 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 12 of 12 new partial solutions.
Adding value "JAPANESE CHIN"
   with keys ("CHI")
   to 12 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 12 of 12 new partial solutions.
Adding value "BOUVIERS DES FLANDRES"
   with keys ("S F" "DES" " DE" "IER" "FL" "VI")
   to 12 partial solutions.
  Deleting 8 equivalent partial solutions.
  Removing 8 dominated solutions.
Returning 56 of 72 new partial solutions.
Adding value "PEKINGESE"
   with keys ("EKI")
   to 56 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 56 of 56 new partial solutions.
Adding value "BELGIAN MALINOIS"
   with keys (" MAL" "OIS" "LG")
   to 56 partial solutions.
  Deleting 96 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 72 of 168 new partial solutions.
Adding value "GERMAN WIREHAIRED POINTERS"
   with keys ("TERS" "D P")
   to 72 partial solutions.
  Deleting 108 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 36 of 144 new partial solutions.
Adding value "CHOW CHOWS"
   with keys ("W ")
   to 36 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 36 of 36 new partial solutions.
Adding value "SAMOYEDS"
   with keys ("DS")
   to 36 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 36 of 36 new partial solutions.
Adding value "DOGUES DE BORDEAUX"
   with keys ("BOR" " DE" "GU")
   to 36 partial solutions.
  Deleting 88 equivalent partial solutions.
  Removing 2 dominated solutions.
Returning 18 of 108 new partial solutions.
Adding value "DALMATIANS"
   with keys ("ANS" "LM")
   to 18 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 36 of 36 new partial solutions.
Adding value "LHASA APSOS"
   with keys ("LH")
   to 36 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 36 of 36 new partial solutions.
Adding value "CANE CORSO"
   with keys (" COR" "ORS")
   to 36 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 72 of 72 new partial solutions.
Adding value "ALASKAN MALAMUTES"
   with keys (" MAL" "TES" "LAS" "KA")
   to 72 partial solutions.
  Deleting 184 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 104 of 288 new partial solutions.
Adding value "WHIPPETS"
   with keys ("IP")
   to 104 partial solutions.
  Deleting 0 equivalent partial solutions.
;GC #199: took:   0.20   (1%) CPU time,   0.10   (1%) real time; free: 16754359
  Removing 0 dominated solutions.
Returning 104 of 104 new partial solutions.
Adding value "SHIBA INU"
   with keys ("SHI" " I")
   to 104 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 208 of 208 new partial solutions.
Adding value "AKITAS"
   with keys ("AK")
   to 208 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 208 of 208 new partial solutions.
Adding value "RHODESIAN RIDGEBACKS"
   with keys ("DES" "DG" "OD")
   to 208 partial solutions.
  Deleting 304 equivalent partial solutions.
  Removing 144 dominated solutions.
Returning 176 of 624 new partial solutions.
Adding value "BICHONS FRISES"
   with keys ("S F" "FR")
   to 176 partial solutions.
  Deleting 224 equivalent partial solutions.
  Removing 16 dominated solutions.
Returning 112 of 352 new partial solutions.
Adding value "PAPILLONS"
   with keys ("API")
   to 112 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 112 of 112 new partial solutions.
Adding value "COLLIES"
   with keys ("IES")
   to 112 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 112 of 112 new partial solutions.
Adding value "VIZSLAS"
   with keys ("LAS" "IZ" "VI")
   to 112 partial solutions.
;GC #200: took:   0.10   (0%) CPU time,   0.10   (1%) real time; free: 16757322
  Deleting 272 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 64 of 336 new partial solutions.
Adding value "BRITTANYS"
   with keys ("ITT")
   to 64 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 64 of 64 new partial solutions.
Adding value "PUGS"
   with keys ("GS")
   to 64 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 64 of 64 new partial solutions.
Adding value "HAVANESE"
   with keys ("HAVANE")
   to 64 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 64 of 64 new partial solutions.
Adding value "COCKER SPANIELS"
   with keys ("ANI" "LS")
   to 64 partial solutions.
  Deleting 80 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 48 of 128 new partial solutions.
Adding value "MASTIFFS"
   with keys ("FS")
   to 48 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 48 of 48 new partial solutions.
Adding value "MALTESE"
   with keys ("TES" "LT")
   to 48 partial solutions.
  Deleting 72 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 24 of 96 new partial solutions.
Adding value "PEMBROKE WELSH CORGIS"
   with keys (" COR" "LS")
   to 24 partial solutions.
  Deleting 32 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 16 of 48 new partial solutions.
Adding value "BOSTON TERRIERS"
   with keys ("IER" " T")
   to 16 partial solutions.
  Deleting 24 equivalent partial solutions.
  Removing 4 dominated solutions.
Returning 4 of 32 new partial solutions.
Adding value "POMERANIANS"
   with keys ("ANS" "ANI")
   to 4 partial solutions.
  Deleting 6 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 2 of 8 new partial solutions.
Adding value "GREAT DANES"
   with keys ("GR")
   to 2 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 2 of 2 new partial solutions.
Adding value "DOBERMAN PINSCHERS"
   with keys ("SCH" " PI")
   to 2 partial solutions.
  Deleting 3 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 4 new partial solutions.
Adding value "SHIH TZU"
   with keys ("SHI" " T")
   to 1 partial solutions.
  Deleting 1 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 2 new partial solutions.
Adding value "ROTTWEILERS"
   with keys ("EI")
   to 1 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 1 new partial solutions.
Adding value "POODLES"
   with keys ("DL" "OD")
   to 1 partial solutions.
  Deleting 1 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 2 new partial solutions.
Adding value "BOXERS"
   with keys ("OX")
   to 1 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 1 new partial solutions.
Adding value "BEAGLES"
   with keys ("AGL")
   to 1 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 1 new partial solutions.
Adding value "LABRADOR RETRIEVERS"
   with keys ("VE")
   to 1 partial solutions.
  Deleting 0 equivalent partial solutions.
  Removing 0 dominated solutions.
Returning 1 of 1 new partial solutions.
;Value 50: (("VE" "AGL" "OX" "EI" "GR" " T" "FS" "LS" "HAVANE" "GS" "ITT" "IES" "API" "FR" "OD" "AK" " I" "IP" "TES" "ORS" "LH" "ANS" "GU" "DS" "W " "EKI" "VI" "CHI" "TERS" "JI" "SCH" "OIS"))
We appear to have the substring version of regex golf under control. Can we extend it to actual regular expressions? Of course we can. In the next installment...

No comments: