Monday, July 7, 2025

LLM failures

I recently tried to get an LLM to solve two problems that I thought were well within its capabilities. I was suprised to find them unable to solve them.

Problem 1: Given a library of functions, re-order the function definitions in alphabetical order.

It can be useful to have the functions in alphabetical order so you can find them easily (assuming you are not constrained by any other ordering). The LLM was given a file of function definitions and was asked to put them in alphabetical order. It refused to do so, claiming that it was unable to determine the function boundaries because it had no model of the language syntax.

Fair enough, but a model of the language syntax isn't strictly necessary. Function definitions in most programming languages are easily identified: they are separated by blank lines, they start at the left margin, the body of the function is indented, the delimiters are balanced, etc. This is source code that was written by the LLM, so it is able to generate syntactically correct functions. It surprised me that it gave up so easily on the much simpler task of re-arranging the blocks of generated text.

Problem 2: Given the extended BNF grammar of the Go programming language, a) generate a series of small Go programs that illustrate the various grammar productions, b) generate a parser for the grammar.

This is a problem that I would have thought would be right up the LLM's alley. Although it easily generated twenty Go programs of varying complexity, it utterly failed to generate a parser that could handle them.

Converting a grammar to a parser is a common task, and I cannot believe that a parser for the Go language does not have several on-line examples. Furthemore taking a grammar as an input and producing a parser is a well-solved problem. The LLM made a quick attempt at generating a simple recursive descent parser (which is probably the easiest way to turn a grammar into a parser short of using a special tool), but the resultant parser could only handle the most trivial programs. When I asked the LLM to extend the parser to handle the more complex examples it had generated, it started hallucinating and getting lost in the weeds. I spent several hours redirecting the LLM and refining my prompts to help it along, but it never was able to make much progress beyond the simple parsing.

(I tried both Cursor with Claude Sonnet 4 and Gemini CLI with Gemini-2.5)

Both these problems uncovered surprising limitations to using LLMs to generate code. Both problems seem to me to be in the realm of problems that one would task to an LLM.