Monday, December 23, 2019

Stringly typed

Here is a typical url: http://www.host.com:8080/path/to/something?arg1=val1&arg2=val2 It has a lot of structure to it. There is the protocol, the hostname, the port, the path, and the query, which itself is made up of key/value pairs. Yet most programs don't treat URLs as data structures, they treat them as strings. (And you have about a 50/50 chance of that & being escaped correctly. Fortunately or not, many programs don't care if things are correctly escaped.)

Here is a UUID: 3e042a1d-0f68-4a78-8732-14e0731d7732 It has five obvious fields, but certain bytes of the UUID are reserved for different purposes. For instance, the most significant bits of group 3 and 4 are used to encode the type of UUID and the other bits might encode a timestamp or MAC address. Yet most programs don't treat UUIDs as data structures. In fact, most don't even bother with a separate type, they just make strings out of them.

Here are some email addresses, some are unusual: very.common@example.com, "john..doe"@example.org, mailhost!username@example.org, user%example.com@example.org These, too, are structured. All contain a fully qualified domain name separated by dots and an @-sign delimiter. Yet most programs don't bother even treating them as separate types and just make strings out of them.

JSON is a data encoding and transfer format which has nested structures and arrays, but ultimately builds structure out of string to string mappings. Often I have seen JSON format used for serializing objects for interprocess calls, network interactions, or storing objects persistently. Far less often have I seen code that deserializes JSON objects into strongly typed, structured data. It is often left in stringified form and the code uses a handful of ad hoc parsers for extracting the necessary data at the last moment, when it is wanted in a more structured format.

All these are examples of “stringly typed” programming. In this style of programming, you get none of the advantages of strong typing and structured data, and all the fun that comes from accidentally using a string that represents one kind of object where another is called for.

The first problem in a stringly typed program is that the representation of abstract objects is exposed everywhere. Callers no longer have to treat objects abstractly and usually don't bother. Why call a special operator when a string operation will suffice? Second, implementors usually don't provide a complete API — why bother when users are going to paste strings together anyway? Implementors often don't even document which string operations are valid and which lead to invalid objects. Third, strings are not a very rich data type. You cannot embed other strings within them, for example, without escape sequences, and if everything is a string, the nested escaping quickly gets out of hand. How often have you seen a regexp with half a dozen ‘/’ characters? How often have you encountered a URL with too many or too few levels of “percent” escapes? Fourth, it encourages implementors to try to be “helpful” with the few API calls they do provide. How often have you seen an API call that recursively unescapes something until there is nothing more to unescape? What if you need a representation of a URL rather than the URL itself? I recall a version of Common Lisp that “helpfully” stripped the “extra” quotation marks off the command line arguments. Unfortunately, this changed the meaning of the argument from a string that represented a pathname to the pathname itself, causing a type error. I eventually resorted to passing the character codes in as arguments and EVALing a (MAP 'STRING CHAR-CODE ...) on them to sneak them by the “helpful” command line processor.

Strings make a fine representation for many data types, but you shouldn't avoid the work needed to make a fully abstract data type even if the underlying type is just strings.


Addendum to John Cowan:

I cannot for the life of me figure out why I cannot comment on my own posts. I've turned off all my browser extensions and even tried other browsers. Nothing. Anyway, here's what I have to say:

Now I know you're trolling me. You cannot be serious about keeping this data as an unstructured string and parsing it with line noise when you need to extract parts of it. The RFC even provides a BNF grammar. You're just going to ignore that?

I've read through your papers, but they hardly argue against structure — they argue against attempting to impose too much structure. They appear to impose a "top-level" structure and argue against trying to decompose things too far because there is too much variation. I agree, but I'm sure you'll agree with me that adding the Jesuit "S.J." to the country code is not a valid operation. I'm arguing that pulling out structure to the point where it makes sense (and not going into the micro-structure where it doesn't) is a better alternative than keeping everything as a string and trying to use regexps and string-manipulation operators to act on your data. If your friend becomes a Jesuit, you're not going to want to try to use some horrific regexp to split the string between his name and his address and then string-append an "S.J."

5 comments:

John Cowan said...

In my opinion, treating email addresses, postal addresses, or personal names as structured objects is generally a mistake, because of their overwhelming complexity. Blogger is giving me a problem here, so I'll post details in oter comments.

John Cowan said...

Here is a regex that matches even the simplified email addresses of RFC 5322 without internationalization. As you can see, it has far too many parts to be put in a manageable structure, even if you just skip past the comments. (I've inserted newlines to try to mollify Blogger.)

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|
"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|
\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]
[0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\
[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

John Cowan said...

Finally, here's my detailed arguments against structured names and telephone numbers, and here's a XML schema (RNC format) for postal addresses, which I wrote when I absolutely needed one. Note, however, that it stores addresses but contains no information about how to format them, not even the order of elements, which varies depending on the sending country as well as the receiving country anyway.

jay gray said...

Would be very interested in your commentary/analysis of JSON-LD and specific @context vocabulary specifications. IMHO that approach implements strong (unambiguous) data-types and a format for 'string-handling' (parsing). An example of a processor: Google Structured Data Testing Tool. JSON-LD must be syntactically correct; content must conform to both a vocabulary and grammar. Please present your POV.

Joe Marshall said...

From a cursory look (which consisted of spending 10 minutes browsing the JSON-LD site), it looks like they intend to address what I was mostly objecting to: the lack of metadata needed to interpret the strings in a JSON message. And it looks like they have largely succeeded as well. The next question is whether JSON-LD will be adopted by the community. For this, it will need to have support in the commonly used tools such as Spring, Swagger, Postman, and MongoDB among others, and then developers will have to actually use the metadata constructively. There's always the temptation to just label it as a string in the metadata, which gets you nowhere. But it does look promising and I look forward to a time where it, or something like it gets adopted and used more widely and we're no longer passing naked strings around.