Abstract Heresies: December 2019

Monday, December 23, 2019

Stringly typed

Here is a typical url: http://www.host.com:8080/path/to/something?arg1=val1&arg2=val2 It has a lot of structure to it. There is the protocol, the hostname, the port, the path, and the query, which itself is made up of key/value pairs. Yet most programs don't treat URLs as data structures, they treat them as strings. (And you have about a 50/50 chance of that & being escaped correctly. Fortunately or not, many programs don't care if things are correctly escaped.)

Here is a UUID: 3e042a1d-0f68-4a78-8732-14e0731d7732 It has five obvious fields, but certain bytes of the UUID are reserved for different purposes. For instance, the most significant bits of group 3 and 4 are used to encode the type of UUID and the other bits might encode a timestamp or MAC address. Yet most programs don't treat UUIDs as data structures. In fact, most don't even bother with a separate type, they just make strings out of them.

Here are some email addresses, some are unusual: very.common@example.com, "john..doe"@example.org, mailhost!username@example.org, user%example.com@example.org These, too, are structured. All contain a fully qualified domain name separated by dots and an @-sign delimiter. Yet most programs don't bother even treating them as separate types and just make strings out of them.

JSON is a data encoding and transfer format which has nested structures and arrays, but ultimately builds structure out of string to string mappings. Often I have seen JSON format used for serializing objects for interprocess calls, network interactions, or storing objects persistently. Far less often have I seen code that deserializes JSON objects into strongly typed, structured data. It is often left in stringified form and the code uses a handful of ad hoc parsers for extracting the necessary data at the last moment, when it is wanted in a more structured format.

All these are examples of “stringly typed” programming. In this style of programming, you get none of the advantages of strong typing and structured data, and all the fun that comes from accidentally using a string that represents one kind of object where another is called for.

The first problem in a stringly typed program is that the representation of abstract objects is exposed everywhere. Callers no longer have to treat objects abstractly and usually don't bother. Why call a special operator when a string operation will suffice? Second, implementors usually don't provide a complete API — why bother when users are going to paste strings together anyway? Implementors often don't even document which string operations are valid and which lead to invalid objects. Third, strings are not a very rich data type. You cannot embed other strings within them, for example, without escape sequences, and if everything is a string, the nested escaping quickly gets out of hand. How often have you seen a regexp with half a dozen ‘/’ characters? How often have you encountered a URL with too many or too few levels of “percent” escapes? Fourth, it encourages implementors to try to be “helpful” with the few API calls they do provide. How often have you seen an API call that recursively unescapes something until there is nothing more to unescape? What if you need a representation of a URL rather than the URL itself? I recall a version of Common Lisp that “helpfully” stripped the “extra” quotation marks off the command line arguments. Unfortunately, this changed the meaning of the argument from a string that represented a pathname to the pathname itself, causing a type error. I eventually resorted to passing the character codes in as arguments and EVALing a (MAP 'STRING CHAR-CODE ...) on them to sneak them by the “helpful” command line processor.

Strings make a fine representation for many data types, but you shouldn't avoid the work needed to make a fully abstract data type even if the underlying type is just strings.

Addendum to John Cowan:

I cannot for the life of me figure out why I cannot comment on my own posts. I've turned off all my browser extensions and even tried other browsers. Nothing. Anyway, here's what I have to say:

Now I know you're trolling me. You cannot be serious about keeping this data as an unstructured string and parsing it with line noise when you need to extract parts of it. The RFC even provides a BNF grammar. You're just going to ignore that?

I've read through your papers, but they hardly argue against structure — they argue against attempting to impose too much structure. They appear to impose a "top-level" structure and argue against trying to decompose things too far because there is too much variation. I agree, but I'm sure you'll agree with me that adding the Jesuit "S.J." to the country code is not a valid operation. I'm arguing that pulling out structure to the point where it makes sense (and not going into the micro-structure where it doesn't) is a better alternative than keeping everything as a string and trying to use regexps and string-manipulation operators to act on your data. If your friend becomes a Jesuit, you're not going to want to try to use some horrific regexp to split the string between his name and his address and then string-append an "S.J."

Monday, December 16, 2019

100% Coverage

I'm doing some consulting for a company that has a policy of 100% branch coverage on all their tests. I have mixed feelings about this. Certainly for some of the software they develop, avionics e.g., software bugs can get people killed and I can see that 100% branch coverage helps assure that this is unlikely (but it doesn't guarantee it, unfortunately. Just because you've tested all the branches doesn't mean the software is correct. But it goes a long way to making sure your software behaves predictably.)

Fortunately, I don’t work on software that people's lives depend upon. I'm working on supply-chain and inventory software and here the issue is murkier. Sure, you can argue that “For want of a nail...”, but realistically, a bug in the supply-chain software is simply an annoyance that means wasted space if you have too many of something or a delay in production if you have too few. And there are already many other sources of production delay. Or perhaps you'll see an ugly display if the bug is in the UI. These are annoying, but hardly life threatening.

There is a significant downside to 100% branch coverage. It means that you have to persuade the code to take branches it wouldn't normally take. Sometimes this is as easy as passing out of range values to a procedure, but sometimes it is hard. Code isn’t designed to fail, it is designed to work, so sometimes you have to induce failure to take the non-normal branch. It shouldn’t be easy to induce failure. It should be close to impossible. That makes testing the failure cases close to impossible as well. You have two equally unpleasant alternatives: either adjust the code to make it easier to induce failure, or remove the branch that handles the “impossible” case.

I am a proponent of “defensive” programming. I put “sanity checks” in my code to ensure that the system is in an expected state. It may not be possible for it to get into an unexpected state, but I put the checks in anyway. I add default clauses even if all the cases are covered. My conditional branches often have a clause for the “can’t happen” case. I add sentinel values to enums that they should never be able to take on. These situations either fail-fast or drop you into the debugger depending on whether you have a debugger or not. This kind of defensive programming introduces branches that should not be possible to take, so it is not possible to get 100% branch coverage. If you want 100% branch coverage, you are left with the unpleasant options of hacking your code so it can fail sanity checks, enter states not covered by all cases, do things that “can’t happen” or take on illegal values, or, alternatively, removing the sanity checks so all branches can be taken.

Code can change, and these defensive techniques help ensure that the code detects breaking changes and acts appropriately by dropping into the debugger or failing-fast.

Of course, if you are flying an airplane, you don't want your software to unexpectedly drop into the debugger and the airplane to unexpectedly drop out of the sky. In this case, a policy of 100% branch coverage makes sense. For business software, it is harder to make that case.

Addendum: Stupid tools.

Java provides a nice set of stupid tools. One prime example is the branch coverage tool. It tells you which conditionals in the code are fully tested or only partially tested. What it doesn't tell you is which branch was tested. So you have to guess. Sometimes it is fairly obvious, but other times, especially with slightly more complex conditionals, it is very difficult to figure out which branches are being taken and which are not. Now there are some reasons why the coverage tool cannot determine this basic information, and I'm sure they are logical (I understand it has something to do with it being difficult to map the optimized byte code back to the source code), but seriously, a tool that can detect coverage but cannot tell you what exactly is covered is only half baked. This encourages people to rewrite their conditionals to avoid anything more complex than single tests, which makes the code more verbose and less readable. Thanks a bunch.

Thursday, December 12, 2019

I think you left something out

I recently had the pleasure of working with some auto-generated Java code. All the accessors and mutators were auto-generated, so was the predicate. The only thing missing was the constructor. You could manipulate one of these objects in any way you desired, but there was no easy way to instantiate one in the first place. I eventually figured out a way to instantiate one through reflection. I created a JSON version of the object and deserialized it. Crude, but effective.

I ran into a similar problem where I needed an instance of an object, but some idiot decided to make all the constructors protected. In this case, I created an unwanted subclass with a public constructor, called that instead, and then upcast to the object I really wanted.

Of course I was using these objects in a way the author of the code did not intend they be used, but the author wasn't omniscient and didn't foresee my use case (which was testing). By trying to prevent unintended uses, he simply forced me to write nasty code to get around his ultimately ineffective roadblocks. The author could have spent his time more productively creating a library with a more flexible API to handle unforeseen uses rather than inventing useless impediments to getting the job done.

I still cannot seem to reply to comments. I type a nice reply, hit the “publish” button, and the reply disappears, never to be seen again.

I'm working at a company that requires 100% branch coverage in tests. This requires teasing white-box libraries into returning malformed replies to cover all the branches in the error handler — even the ones that cannot actually be taken because the library doesn't actually generate malformed replies. So I have to supply the malformed replies myself, and it isn't easy if there aren't data constructors. We don't have AspectJ, and the source code isn't always available. And even when it is, it's simply not a good idea to modify it so that it is capable of generating errors it wouldn't otherwise be able to. This leaves me to resort to the hacks.

I was wrong. I just stumbled across the first example. It turns out that the constructor was generated, it just had no arguments. Now this is not unusual in Java (although I consider it bad form. You should be able to construct and initialize object in one step rather than have to construct an skeleton object and smash in the field values. This reduces your chances of creating partially initialized objects.) What made this a problem is that the generated code had no mutators, so you could only create skeleton objects, you couldn't put any values in them. I did remember correctly that I deserialized a JSON version to create an object with actual values in it. This seemed a better option than directly using reflection because the JSON object “looked” like the resulting object I wanted.

I'm no fan of just making every field mutable by default — just the opposite, in fact. Objects should usually be immutable. But you ought to have a way to create an initialized object without resorting to reflection.

Tuesday, December 10, 2019

(map append elements)

I can recall the difficulty I had understanding let expressions in Lisp. I used to write out the equivalent lambda expression and then manually add the syntactic sugar to make it into a let. And I don't recall having an easy time with the chapter on linked lists. So I find it amusing to see what tortuous code students come up with when they are first introduced to linked lists. My favorite so far has been (map append elements), which looks like it might join the sublists in elements, but if you think about it, the arity is all wrong: map doesn't apply its function, it calls it. This had me scratching my head for several minutes until I realized you can call apply on a single argument.

So what little gems of code have you discovered that perform the most trivial of tasks in the most convoluted or obscure manner?

Friday, December 6, 2019

“No ‘x’ in ‘Nixon’”

I just had a phone screen where they asked if I could write a palindrome detector. I wrote a recursive and an iterative version. I like the recursive solution because it is so simple. There are only three equations:

An empty string is a palindrome.
A string with one character is a palindrome.
If a string starts and ends with the same character, and the substring between them is a palindrome, then the entire string is a palindrome.

I wrote them the iterative version so I wouldn't look like an idiot. The iterative version is generally faster with a standard compiler, but it is more complex. Not much more complex, but still, you have to keep track of two indexes into the string and how close they've gotten to each other.

The recursive solution is just more elegant.

Apparently I cannot post comments on my own blog. Well, I can edit the post. "if s[pos] != s[-pos-1]" sure looks like two pointers, but to be fair, the recursive solution hides its pointers in the string structure or in the slices if it is using them. As for "(string= x (string-reverse x))", there is something elegant in something that crude.