Monday, August 4, 2025

Challenges of Pseudocode Expansion

Expanding pseudocode to Common Lisp has some interesting challenges that are tricky to solve. Here are a few of them:

Quoted code

The LLM tends to generate markdown. It tends to place “code fences” around code blocks. These are triple backticks (```) followed by the language name, then the code, then another set of triple backticks. This is a common way to format code in markdown.

Preferrably, the generated code should be a standalone s-expression that we can simply read. I have put explicit instructions in the system instructions to not use language fences, but the LLM can be pretty insistent about this. Eventually, I gave up and wrote some code to inspect the generated code and strip the language fences.

The LLM has a tendency to generate code that is quoted. For example, when told "add a and b", it will generate `(+ a b), which is a list of three symbols. This is usually not what we want (but it is a valid thing to want at times). It is only the top-level expression that is problematic — the LLM doesn't generate spurious internal quotations. I was unable to persuade the LLM to reliably not quote the top-level expression, so I wrote some code that inspects the generated code and removes an initial backtick. This is a bit of a hack because there are legitimate cases where a leading backtick is in fact correct, but more often than not it is an unwanted artifact. If the LLM truly “wanted” to generate a template, it could use explicit list building operations like list, cons, and append at top level.

Known Functions

If you just ask the LLM to generate code, it will often make up function names that do not exist, or refer to libraries that are not loaded. To solve this problem, I provide a list of functions and I tell the LLM that the generated code can only use those functions. We could just provide the LLM with a list of all symbols that are fbound, but this would be a very long list and it would include functions that were never meant to be called from outside the package they are defined in. Instead, I provide two lists: one of the fbound symbols that are visible in the current package — either present directly in the package or external in packages used by the current package, and then a second list of symbols that are fbound and external in any package. The symbols in the first list can be referred to without package qualifiers, while the symbols in the second list must have a package qualifier. The LLM is instructed to prefer symbols from the first list, but it is allowed to use symbols from the second list. That is, the LLM will prefer to use symbols that don't require a package qualifier, but it can borrow symbols that other packages export if it needs to. I provide analagous lists of bound symbols (global variables). This works pretty well — not prefectly, but well enough. It does require that the libraries are loaded prior to any pseudocode expansion so that we can find the library's external symbols.

But we have the problem that the code we are expanding is not yet loaded. The LLM won't know about the functions defined in the current file until we load it. This is a serious problem. The solution is to provide the LLM with the source code of the file being compiled. This is a bit of a hack, but it exposes the names being defined to the LLM so it can generate code that refers to other definitions in the same file.

Naive Recursion

Suppose the LLM is told to define a function FOO that subtracts two numbers. It looks in the source code and discovers a definition for a function called FOO, i.e. the very fragment of source code we are expanding. It sees that this function, when defined, is supposed to subtract two numbers. This is precisely what we want to do, so the LLM reasons that we can simply call this function. Thus the body of FOO becomes a call to the function FOO. Obviously this kind of naive recursion is not going to work, but getting rid of it is easier said than done.

We could try to persuade the LLM to not use the name of the function currently being defined, but this would mean that we couldn't use recursion at all. It also isn't sufficient: two mutually recursive functions could still expand into calls to each other in a trivial loop. We don't want to prohibit the LLM from using other names in the file so we try to persuade the LLM to generate code that implements the desired functionality rather than code that simply tail calls another function. This works better on the “thinking” models than it does on the “non-thinking” ones. The “non-thinking” models produce pretty crappy code and often has trivial recursive loops.

I haven't found a reliable satisfactory solution to this, and to some extent it isn't suprising. When Comp Sci students are first introduced to recursion, they often don't understand the idea of the recursion bottoming out in some base case. The LLM isn't a Comp Sci student, so there are limits as to what we can “teach” it.

Attempts at Interpretation

The LLM will often try to run the pseudocode rather than expand it. If it can find “tools” that appear to be relevant to the pseudocode, it may try to call them. This isn't what we want. We want the LLM to generate code that calls functions, not to try to call the function itself. Currently I'm handling this by binding the current tools to the empty list so that the LLM doesn't have anything to call, but it would be nice if the LLM had a set of introspection tools that it could use to discover things about the Lisp environment. It might be interesting to give the LLM access to Quicklisp so it could download relevant libraries.

Training the LLM

The current implementation of pseudocode expansion uses a big set of system instructions to list the valid functions and varibles and to provide persuasive instructions to the LLM to avoid the pitfalls above. This means that each interaction with the LLM involves only a few dozen tokens of pseudocode, but tens of thousands of tokens of system instructions. I am naively sending them each time. A better solution would be to upload the system instructions to the server and prime the LLM with them. Then we would only be sending the pseudocode tokens. This would cost much less in the long run.

Docstrings

Lisp has a rich set of documentation strings for functions and variables. We can send these to the LLM as part of the system instructions to help the LLM generate code. The problem is that this bloats the system instructions by a factor of 10. A full set of docstrings is over 100,000 tokens, and within a few interactions you'll use up your daily quota of free tokens and have to start spending money.

Tradeoffs

The more sophisticated LLM models produce better code, but they are considerably slower than the naive models. The naive models are quick, but they often produce useless code. I don't know if it is possible to customize an LLM to be at a sweet spot of performance and reliability.

The naive approach of sending the system instructions with each interaction is wasteful, but it is simple and works. It is good as an experimental platform and proof of concept, but a production system should have a driver application that first introspects the lisp environment and then uploads all the information about the environment to the LLM server before entering the code generation phase.

No comments: