http://www.host.com:8080/path/to/something?arg1=val1&arg2=val2It has a lot of structure to it. There is the protocol, the hostname, the port, the path, and the query, which itself is made up of key/value pairs. Yet most programs don't treat URLs as data structures, they treat them as strings. (And you have about a 50/50 chance of that & being escaped correctly. Fortunately or not, many programs don't care if things are correctly escaped.)
Here is a UUID:
3e042a1d-0f68-4a78-8732-14e0731d7732It has five obvious fields, but certain bytes of the UUID are reserved for different purposes. For instance, the most significant bits of group 3 and 4 are used to encode the type of UUID and the other bits might encode a timestamp or MAC address. Yet most programs don't treat UUIDs as data structures. In fact, most don't even bother with a separate type, they just make strings out of them.
Here are some email addresses, some are unusual:
email@example.comThese, too, are structured. All contain a fully qualified domain name separated by dots and an @-sign delimiter. Yet most programs don't bother even treating them as separate types and just make strings out of them.
JSON is a data encoding and transfer format which has nested structures and arrays, but ultimately builds structure out of string to string mappings. Often I have seen JSON format used for serializing objects for interprocess calls, network interactions, or storing objects persistently. Far less often have I seen code that deserializes JSON objects into strongly typed, structured data. It is often left in stringified form and the code uses a handful of ad hoc parsers for extracting the necessary data at the last moment, when it is wanted in a more structured format.
All these are examples of “stringly typed” programming. In this style of programming, you get none of the advantages of strong typing and structured data, and all the fun that comes from accidentally using a string that represents one kind of object where another is called for.
The first problem in a stringly typed program is that the representation of abstract objects is exposed everywhere. Callers no longer have to treat objects abstractly and usually don't bother. Why call a special operator when a string operation will suffice? Second, implementors usually don't provide a complete API — why bother when users are going to paste strings together anyway? Implementors often don't even document which string operations are valid and which lead to invalid objects. Third, strings are not a very rich data type. You cannot embed other strings within them, for example, without escape sequences, and if everything is a string, the nested escaping quickly gets out of hand. How often have you seen a regexp with half a dozen ‘/’ characters? How often have you encountered a URL with too many or too few levels of “percent” escapes? Fourth, it encourages implementors to try to be “helpful” with the few API calls they do provide. How often have you seen an API call that recursively unescapes something until there is nothing more to unescape? What if you need a representation of a URL rather than the URL itself? I recall a version of Common Lisp that “helpfully” stripped the “extra” quotation marks off the command line arguments. Unfortunately, this changed the meaning of the argument from a string that represented a pathname to the pathname itself, causing a type error. I eventually resorted to passing the character codes in as arguments and
(MAP 'STRING CHAR-CODE ...)on them to sneak them by the “helpful” command line processor.
Strings make a fine representation for many data types, but you shouldn't avoid the work needed to make a fully abstract data type even if the underlying type is just strings.
Addendum to John Cowan:
I cannot for the life of me figure out why I cannot comment on my own posts. I've turned off all my browser extensions and even tried other browsers. Nothing. Anyway, here's what I have to say:
Now I know you're trolling me. You cannot be serious about keeping this data as an unstructured string and parsing it with line noise when you need to extract parts of it. The RFC even provides a BNF grammar. You're just going to ignore that?
I've read through your papers, but they hardly argue against structure — they argue against attempting to impose too much structure. They appear to impose a "top-level" structure and argue against trying to decompose things too far because there is too much variation. I agree, but I'm sure you'll agree with me that adding the Jesuit "S.J." to the country code is not a valid operation. I'm arguing that pulling out structure to the point where it makes sense (and not going into the micro-structure where it doesn't) is a better alternative than keeping everything as a string and trying to use regexps and string-manipulation operators to act on your data. If your friend becomes a Jesuit, you're not going to want to try to use some horrific regexp to split the string between his name and his address and then string-append an "S.J."