I recently tried out GitHub CoPilot. It is a system that uses generative AI to help you write code.
The tool interfaces to your IDE — I used VSCode — and acts as an autocomplete on steroids … or acid. Suggested comments and code appear as you move the cursor and you can often choose from a couple of different completions. The way to get it to write code was to simply document what you wanted it to write in a comment. (There is a chat interface where you can give it more directions, but I did not play with that.)
I decided to give it my standard interview question: write a simple TicTacToe class, include a method to detect a winner. The tool spit out a method that checked an array for three in a row horizontally, vertically, and along the two diagonals. Almost correct. While it would detect three ‘X’s or ‘O’s, it also would detect three nulls in a row and declare null the winner.
I went into the class definition and simply typed a comment
character. It suggested an __init__
method. It
decided on a board representation of a 1-dimensional array of 9
characters, ‘X’ or ‘O’ (or null), and a character that determined
whose turn it was. Simply by moving the cursor down I was able to
get it to suggest methods to return the board array, return the
current turn, list the valid moves, and make a move. The suggested
code was straightforward and didn’t have bugs.
I then decided to try it out on something more realistic. I have a linear fractional transform library I wrote in Common Lisp and I tried porting it to Python. Co-pilot made numerous suggestions as I was porting, to various degrees of success. It was able to complete the equations for a 2x2 matrix multiply, but it got hopelessly confused on higher order matrices. For the print method of a linear fractional transform, it produced many lines of plausible looking code. Unfortunately, the code has to be better than “plausible looking” in order to run.
As a completion tool, co-pilot muddled its way along. Occasionally, it would get a completion impressively right, but just as frequently — or more often — it would get the completion wrong, either grossly or subtly. It is the latter that made me nervous. Co-pilot would produce code that looked plausible, but it required a careful reading to determine if it was correct. It would be all too easy to be careless and accept buggy code.
The code Co-Pilot produced was serviceable and pedestrian, but often not what I would have written. I consider myself a “mostly functional” programmer. I use mutation sparingly, and prefer to code by specifying mappings and transformations rather than sequential steps. Co-pilot, drawing from a large amount of code written by a variety of authors, seems to prefer to program sequentially and imperatively. This isn’t surprising, but it isn’t helpful, either.
Co-pilot is not going to put any programmers out of work. It simply isn’t anywhere near good enough. It doesn’t understand what you are attempting to accomplish with your program, it just pattern matches against other code. A fair amount of code is full of patterns and the pattern matching does a fair job. But exceptions are the norm, and Co-pilot won’t handle edge cases unless the edge case is extremely common.
I found myself accepting Co-pilot’s suggestions on occasion. Often I’d accept an obviously wrong suggestion because it was close enough and the editing seemed less. But I always had to guard against code that seemed plausible but was not correct. I found that I spent a lot of time reading and considering the code suggestions. Any time savings from generating these suggestions was used up in vetting the suggestions.
One danger of Co-pilot is using it as a coding standard. It produces “lowest common denominator” code — code that an undergraduate that hadn’t completed the course might produce. For those of us that think the current standard of coding is woefully inadequate, Co-pilot just reinforces this style of coding.
Co-pilot is kind of fun to use, but I don’t think it helps me be more productive. It is a bit quicker than looking things up on stackoverflow, but its results have less context. You wouldn’t go to stackoverflow and just copy code blindly. Co-pilot isn’t quite that — it will at least rename the variables — but it produces code that is more likely buggy than not.
Joe,
ReplyDeleteI have no experience with Copilot, but I've been using ChatGPT to generate and transform code in the recent weeks. I mostly use it when I need to write something from scratch. I provide the basic set of requirements in prose, and maybe some additional technical context like, say, a set of PostgreSQL DDL statements that define tables. ChatGPT is very good at providing with the scaffolding and some naive implementation. This of course works best with languages that are popular today, but I can firmly say that it is a huge time saver and that it allows me to tackle tasks that'd previously cost me too much time just in ceremony.
-Hans
My experience experimenting with LLMs is that they are principally tuned to concept mapping. So, to take your tic-tac-toe example, the LLM takes your description, maps the concept of the game, then takes your request for a specific representation, maps that concept, and subsequently returns the product of those two. In so far as concepts can be represented in code, this is a mostly reasonable approach. But as any programmer worth their salt knows, code that accurately "represents a concept" is very different from code that actually "works".
ReplyDeleteInstead of having an LLM write code for you, I wonder if it might be more successful at writing pseudo-code, UML diagrams, or acceptance criteria, leaving the actual code writing for the code writers.