Expanding pseudocode to Common Lisp has some interesting challenges that are tricky to solve. Here are a few of them:
Quoted code
The LLM tends to generate markdown. It tends to place “code fences” around code blocks. These are triple backticks (```) followed by the language name, then the code, then another set of triple backticks. This is a common way to format code in markdown.
Preferrably, the generated code should be a standalone s-expression that we can simply read. I have put explicit instructions in the system instructions to not use language fences, but the LLM can be pretty insistent about this. Eventually, I gave up and wrote some code to inspect the generated code and strip the language fences.
The LLM has a tendency to generate code that is quoted. For
example, when told "add a and b", it will generate `(+ a b), which
is a list of three symbols. This is usually not what we want (but
it is a valid thing to want at times). It is only the
top-level expression that is problematic — the LLM doesn't
generate spurious internal quotations. I was unable to persuade the
LLM to reliably not quote the top-level expression, so I wrote some
code that inspects the generated code and removes an initial
backtick. This is a bit of a hack because there are legitimate
cases where a leading backtick is in fact correct, but more often
than not it is an unwanted artifact. If the LLM
truly “wanted” to generate a template, it could use
explicit list building operations
like list
, cons
, and append
at top level.
Known Functions
If you just ask the LLM to generate code, it will often make up function names that do not exist, or refer to libraries that are not loaded. To solve this problem, I provide a list of functions and I tell the LLM that the generated code can only use those functions. We could just provide the LLM with a list of all symbols that are fbound, but this would be a very long list and it would include functions that were never meant to be called from outside the package they are defined in. Instead, I provide two lists: one of the fbound symbols that are visible in the current package — either present directly in the package or external in packages used by the current package, and then a second list of symbols that are fbound and external in any package. The symbols in the first list can be referred to without package qualifiers, while the symbols in the second list must have a package qualifier. The LLM is instructed to prefer symbols from the first list, but it is allowed to use symbols from the second list. That is, the LLM will prefer to use symbols that don't require a package qualifier, but it can borrow symbols that other packages export if it needs to. I provide analagous lists of bound symbols (global variables). This works pretty well — not prefectly, but well enough. It does require that the libraries are loaded prior to any pseudocode expansion so that we can find the library's external symbols.
But we have the problem that the code we are expanding is not yet loaded. The LLM won't know about the functions defined in the current file until we load it. This is a serious problem. The solution is to provide the LLM with the source code of the file being compiled. This is a bit of a hack, but it exposes the names being defined to the LLM so it can generate code that refers to other definitions in the same file.
Naive Recursion
Suppose the LLM is told to define a function FOO that subtracts two
numbers. It looks in the source code and discovers a definition for
a function called FOO
, i.e. the very fragment of
source code we are expanding. It sees that this function, when
defined, is supposed to subtract two numbers. This is precisely
what we want to do, so the LLM reasons that we can simply call this
function. Thus the body of FOO becomes a call to the function FOO.
Obviously this kind of naive recursion is not going to work, but
getting rid of it is easier said than done.
We could try to persuade the LLM to not use the name of the function currently being defined, but this would mean that we couldn't use recursion at all. It also isn't sufficient: two mutually recursive functions could still expand into calls to each other in a trivial loop. We don't want to prohibit the LLM from using other names in the file so we try to persuade the LLM to generate code that implements the desired functionality rather than code that simply tail calls another function. This works better on the “thinking” models than it does on the “non-thinking” ones. The “non-thinking” models produce pretty crappy code and often has trivial recursive loops.
I haven't found a reliable satisfactory solution to this, and to some extent it isn't suprising. When Comp Sci students are first introduced to recursion, they often don't understand the idea of the recursion bottoming out in some base case. The LLM isn't a Comp Sci student, so there are limits as to what we can “teach” it.
Attempts at Interpretation
The LLM will often try to run the pseudocode rather than expand it. If it can find “tools” that appear to be relevant to the pseudocode, it may try to call them. This isn't what we want. We want the LLM to generate code that calls functions, not to try to call the function itself. Currently I'm handling this by binding the current tools to the empty list so that the LLM doesn't have anything to call, but it would be nice if the LLM had a set of introspection tools that it could use to discover things about the Lisp environment. It might be interesting to give the LLM access to Quicklisp so it could download relevant libraries.
Training the LLM
The current implementation of pseudocode expansion uses a big set of system instructions to list the valid functions and varibles and to provide persuasive instructions to the LLM to avoid the pitfalls above. This means that each interaction with the LLM involves only a few dozen tokens of pseudocode, but tens of thousands of tokens of system instructions. I am naively sending them each time. A better solution would be to upload the system instructions to the server and prime the LLM with them. Then we would only be sending the pseudocode tokens. This would cost much less in the long run.
Docstrings
Lisp has a rich set of documentation strings for functions and variables. We can send these to the LLM as part of the system instructions to help the LLM generate code. The problem is that this bloats the system instructions by a factor of 10. A full set of docstrings is over 100,000 tokens, and within a few interactions you'll use up your daily quota of free tokens and have to start spending money.
Tradeoffs
The more sophisticated LLM models produce better code, but they are considerably slower than the naive models. The naive models are quick, but they often produce useless code. I don't know if it is possible to customize an LLM to be at a sweet spot of performance and reliability.
The naive approach of sending the system instructions with each interaction is wasteful, but it is simple and works. It is good as an experimental platform and proof of concept, but a production system should have a driver application that first introspects the lisp environment and then uploads all the information about the environment to the LLM server before entering the code generation phase.
No comments:
Post a Comment