Abstract Heresies: April 2010

Friday, April 30, 2010

Importance of terminology

An upcoming study of first-year computer science students has an interesting observation about terminology. It seems that their understanding of the computer science vocabulary is much worse than one would expect. You cannot blame the teachers because I'm sure they were unaware of the problem. You cannot blame the students for being `dumb' because nearly all of them could identify a `function body'. Yet fewer than one in three could tell you what is meant by a `defined name' or a `predicate'. Really! The results surprised me, but in thinking about it, I have a couple of personal anecdotes that highlight the problem.

Every year or so there used to be a heated debate on one or more usenet groups (remember usenet?) between the static and dynamic type factions. (I won't argue either side.) One year, however, I encountered someone who challenged me about my viewpoint. He said that he knew many people who held a similar viewpoint to mine, and he had a lot of respect for them. They weren't idiots. But he could not understand their rationale. His viewpoint made perfect logical sense to him, so how could another intelligent person disagree so violently? He didn't want to argue, he wanted to understand how anyone could find the alternative viewpoint to be as logical. (He assumed it must be so, or there wouldn't be so many respectable adherents to it.)

We discussed it back and forth for a while in a non-heated way and it dawned on me that the fundamental problem is simply one of terminology. `Static typers' and `dynamic typers' have completely different understandings of the word `type'. They cannot agree on the simplest things because they each think the other knows what he means by `type'.

For the sake of clarity, I'll lay out the definitions:

Type₁ — an equivalence class for data. Two objects are of the ‘same type’ if operations on one could be (or could have been) performed on the other with analagous results. For example, ‘integer’ is a type because objects like ‘3’ and ‘42’ can both be added, subtracted, multiplied, and divided.
Type₂ — a purely syntactic label that can be assigned to a code fragment. There are rules for assigning such labels to primitive fragments and rules for composing types as the code fragments are composed. Certain compositions are disallowed. A well-typed program has no disallowed compositions, and a type-checking program can verify this. By clever construction of the rules, certain useful runtime properties can be proven.

I think you can see that these two definitions have almost nothing to do with each other.

Once I understood this, I stopped using the word ‘type’ and started talking about ‘tags’ — manifest metadata attached to all objects in memory that describe what each object is. ‘Static typers’ understand the utility and benefit such things, they don't understand why anyone would want to forego compile-time ‘type checks’. They don't mean the annoying task of decorating every variable and procedure in the system with a ‘type specifier’, but rather having the compiler notice that your code said to vector-ref the result of a floating point divide and that most likely is not what you intended. Notwithstanding the contrarians who complain that maybe they did want to vector-ref a float, every Lisp hacker I know of appreciates getting early notification of these kind of errors (provided that false positives are very rare and it remains unnecessary to kowtow to the compiler).

Next anecdote later...

Thursday, April 29, 2010

Persistent store: a new record type

The object map allows us to abstract away the physical addresses where our objects are stored, but only if we can remember where we put it! We need another record type and things will get much more interesting.

A commit-record is a special record that we will write to the durable storage. The commit-record will contain (among other things) the physical address of the root of the object map. We'll require a little more functionality from the durable storage layer. When we first use a previously existing persistent store, we need to ask the durable storage layer to find the most recent valid and complete commit record. It can do this by starting at the newest record and reading backwards through earlier records until it finds one.

Since computers are unreliable, we have to take into account the possibility that the computer could crash in the middle of writing a commit record. When looking for the commit record on restart, we ignore any partially written commit records. When we're writing a commit record to the durable storage, we don't return control until we're sure that the record is durably saved and complete. If the computer crashes at any time before that last byte is written, we will recover using the previous valid commit record, and if it crashes at any time after that last byte is written we'll use this one.

There is an extra burder on the durable store to be able to find the most recent commit record, but we can take a different burden off the store. Instead of insisting that every record be durably written at the time we send it to the store, we can allow most of the records to be buffered until it is more convenient to write. We only insist that the buffers be flushed prior to writing the commit record. This makes a huge difference in the performance of the store.

There is a design decision to be made at this point. When we save durable objects, we can either send the object and the object-map records to the durable store immediattely, or we can defer sending them until we want to write the commit record. The eager strategy is more responsive when we are doing a lot of writes because we can start writing records while we're computing. The lazy strategy will reduce the total amount of traffic (every time we modify the object-map, we need to write new object-map records, if newer entries supersede records that haven't even been written, we can elide the write), but when we eventually do write the commit record, there will be a flurry of activity as we force the pending writes. This will cause a noticable pause at commit time.

There is another interesting design decision. In many cases, it is ok to lose a few seconds of work if the computer crashes. You wouldn't want to lose anything if you were logging financial transactions, but if you were checkpointing a text file, you might consider it ok if two or three of the most recently typed letters didn't get saved. When the computer restarted, you'd re-open the text file and find that what you were typing wasn't completely there, but was mostly there. You can re-enter the last couple of keystrokes. If this is acceptable — and it is for some applications and not for others — then it is not necessary for commit records to be immediately written to durable storage so long as they are written in a reasonably timely manner (within a second or two). The freedom to defer writing commit records like this will also improve performance quite a bit.

Exercise: Add commit records to your durable store and the ability to recover the most recent commit record when opening an existing store.

Tuesday, April 27, 2010

Open letter to the SEC

To Whom it may concern:

In the document entitled

    SECURITIES AND EXCHANGE COMMISSION
    17 CFR Parts 200, 229, 230, 232, 239, 240, 243 and 249
    Release Nos. 33-9117; 34-61858; File No. S7-08-10
    RIN 3235-AK37 
    ASSET-BACKED SECURITIES

a number of changes to Item 601 of Regulation S-K, Rules 101, 201, 202 and 305 of Regulation S-T, a new Rule 314 of Regulation S-T and changes to Form 8-K are proposed. These changes are proposed to accommodate the filing of the ‘waterfall’ computer program. The document requests comments on this proposal.

Is it appropriate for us to require most ABS issuers to file the waterfall computer program?

As mentioned in the proposal, ABS issuers are currently required to submit an informal, ‘English language’ description of the waterfall. A computer program would be far more formal and less likely to be ambiguous or misinterpreted.

Is it appropriate to require issuers to submit the waterfall computer program in a single programming language, such as Python, to give investors the benefit of a standardized process? If so, is Python the best choice or are there other open source programming language alternatives (such as PERL) that would be better suited for these purposes?

The proposal goes to great lengths to justify the choice of Python as the language in which the waterfall program should be expressed. Unfortunately, every one of the justifications is either incorrect or irrelevant.

Python is an open source interpreted programming language. — Python is more accurately described as a family of programming languages with several implementations. The Python Software Foundation supports one particular implementation, but both Microsoft and Nokia offer alternative implementations. The Python Software Foundation provides an interpreter, but there are versions of Python that are fully compiled, byte-code compiled, or JIT compiled.
Open source means that the source code is available to all users.— This is true, but irrelevant. INTERCAL is an open source language, yet it is completely unsuitable for the purposes of writing a waterfall program. On the other hand, the Microsoft implementation of ECMAScript is proprietary, yet ECMAScript is certainly a language to consider for waterfall programs.
An interpreted language is a programming language that requires an interpreter in the target computer for program execution. — Interpretation is a technique for implementing a language. Few languages require an interpreter, and as noted above, compilers for Python exist.
Since Python is an interpreted language that does not need to be compiled prior to running it, executable code would not need to be published on EDGAR. We define executable code in Rule 11 of Regulation S-T [17 CFR 239.11] as instructions to a computer to carry out operations that use features beyond the viewer's, reader's, or Internet browser's native ability to interpret and display HTML, PDF, and static graphic files. Such code may be in binary (machine language) or in script form. Executable code includes disruptive code.— A Python program is most certainly ‘executable’ by the definition supplied.

The proposal appears to be suggesting that safety is an important desideratum. This includes the safety of the EDGAR system as well as the safety of the investors that would presumably use the waterfall program. The proposal concludes that a language that is executed by an interpreter would be inherently safer than a compiled language. This is false. There has been a considerable amount of research into computer safety, and it is clear that safety is very hard to achieve even if it is the highest priority in the design. Provable safety is not, and never has been, a goal of Python.

To conclude, I believe that a formal description of the waterfall algorithms written as a computer program may be quite beneficial to investors. The proposal, however, fails to enumerate the desiderata for the language such a program would be written in. The proposal makes a few erroneous assertions and then puts forward Python as a suggested language. While Python is a popular language, it was never intended to be used as a critical component of such an important part of the U.S. economy.

I suggest that Securities and Exchange Commission, if they intend to pursue the path of requiring a formal waterfall program to be provided, follow a formal process for finding or developing a suitable language. First, the requirements for such a language need to be specified. Second, these requirements should be compared against existing and proposed languages. Finally, one or more of the languages should be adopted and implemented.

Sincerely,
Joe Marshall

More on persistence

I haven't forgotten about persistence, I've just been busy.

When I last blogged about it, I discussed the need for an object-map. The object-map maps logical object ids to the physical location that the durable store placed the object. We'll be putting a couple more kinds of things in the durable store, so at this point it is a good idea to add a record abstraction on top of the durable store. It is easier to show this than to describe it. Here are the actual contents of a test durable store:

;;; A persistent store.  Do not edit.
(object 1 929 <string> 1 "zip")
(leaf 1 38)
(object 1 269 <string> 1 "zero?")
(leaf 1 82)
(branch 1 116 0 38)
(object 1 387 <string> 1 "yield-current-thread")
(leaf 1 148)
(branch 1 0 197 82)
(leaf 1 38)
(leaf 1 82)
(branch 1 242 230 148)
(object 1 535 <string> 1 "xsubstring-move!")
(leaf 1 277)
(branch 1 322 0 38)
(branch 1 242 335 148)
(object 1 295 <string> 1 "xsubstring-find-previous-char-in-set")

A ‘record’ is simply a list. The first element is one of object, leaf, or branch. The second element is the version number of the record layout in case I decide to change how the record is represented. The remaining elements are the contents of the record. For a leaf, it is simply the address of the object that this leaf represents. For a branch, it is the addresses of the left and right child and the address of the object the branch represents. (This is using a functional binary tree as detailed in the code.) An object record has four additional components. First is the object-id. Second is the data tag of the object. Third is a version number of the serialization method for the object and finally the serialized representation of the object itself.

This is clearly not the most parsimonious representation, and it could be made much more robust by including record length information, checksums or CRCs, and by ensuring that user data (such as those strings) cannot spoof a record boundary, but this is demonstration code, not production code.

In the next installment, we add another record type.

Friday, April 23, 2010

The Education of JRM

I learn so much from the Internet. Yesterday, someone posted a link to my post about Java. I got close to 3 orders of magnitude increase in traffic. Many people commiserated with me, and I thank them. Some, however, went the extra mile:

An ellipsis has three dots.

Mea culpa. I was sloppy. Fortunately, another reader defended me:

An ellipsis at the end of a sentence is three dots, followed by a period. He/she did it right.

Alas, if I did it right it was only by accident.

There should be a space between the ellipsis and the period. Putting four dots ala George Lucas is incorrect....

I'm flattered to be placed in such good company!

One reader pointed out that the problem might not lie in Java, but within me:

This is what reflection is for. If you can't solve this problem with reflection, then you don't really have reflection.

while others offered concrete advice:

Maybe you just need to learn to type faster.

and still others tactfully suggested I broaden my horizons:

ever code in C++? this isn't unique to Java, noob.

I thank you all for the advice and the spirit in which you offered it. I believe I'll reflect upon it this weekend — assuming I really have reflection. I will continue posting in the hope that I can be further edified.

Wednesday, April 21, 2010

Whenever I write code in Java....

Whenever I write code in Java I feel like I'm filling out endless forms in triplicate.

“Ok, sir, I'll just need your type signature here, here, and ... here. Now will this be everything, or...”

‘Well, I might need to raise an exception.‘

The compiler purses its lips.“An exception? Hmmm... let's see.... Yes, I think we can do that... I have the form over here... Yes, here it is. Now I need you to list all the exceptions you expect to raise here. Oh, wait, you have other classes? We'll have to file an amendment to them. Just put the type signature here, here, ... yes, copy that list of exceptions....

Student M overhears the argument between Students A and T. “What seems to be the problem?”

Student T explains, ‘We're at an impasse. I want to be able to change the semantics of the language...’

Student A says “... and I want the semantics to be stable.”

Student T says ‘It seems to me that if the program is a tool for solving a problem, then the language itself is just another tool. I should be able to make changes to the language if it helps solve the problem.’

Student A replies “And I have no problem with that in principle, but when I'm writing a program, I need to know what the language semantics are so I can have some idea about whether my program will work. If the semantics change after I write the program, I won't even know whether I have a program. Mangle the semantics all you want, but then tell me what they are and don't change them!”

Student T says ‘I'm not going to “mangle” the semantics. But I don't want to lock myself out of a solution for your convenience. I want to be able to take whatever baseline semantics we have and tweak them as appropriate, but I can't change the semantics before I write the program the changes them!’

Student A says “This is just impossible.”

Student M interrupts “Time out, everyone. Ok, now let me get this straight. You...” and he points to student A, “want the meaning of your code to be stable, and you...” (he points to student T) “ want to be able to tailor the semantics to suit you. That's simple! Why are you arguing?”

Students A and T look confused. ‘Simple?!’

Student M says, “Of course. You both agree that changing the language will change the semantics, and conversely changing the semantics changes the language...”

Students A and T nod.

Student M turns to student A “So you write your code in your language (let's call it L), and we'll be sure that L has stable, static semantics that never change.”

Student A says “That's what I'm talking about.”

Student T objects ‘Hold on! You're just completely ignoring me! What if I don't like the semantics?’

Student M turns to student T “I'm not done, yet. You want the freedom to morph language L into language L', which is very much like L, but with maybe a few changes...”

Student T interrupts ‘.. or a lot of changes. And not just L', I want L, L', L'', L''', whatever changes I think are necessary and whenever I think they are necessary.’

Student M continues “... or a lot of changes. I get you. Now here's the question: If I have a program in language L, but I have a machine that runs a different language — call it C — what do I do?”

Student T replies ‘Write an interpreter...’

Student A answers “ ... or a compiler.”

Student M says “Bingo! Problem solved. Simply use language L as the base language for an interpreter or compiler for programs in L' (or L'' or L''', etc.)”

Student A thinks for a moment then says “Works for me. If I need to understand code written in L', I can just compile it down to L.”

Student T looks dubious. ‘Wait a second. You're suggesting I write a compiler? What if I want several different language features to mix and match? I'm supposed to write compilers for each of them?!’

Student M shrugs “Well, you're the one that wants to change the language, you can't expect student A to write them. Besides, it isn't that much work.”

Student T protests ‘Not that much work? It's a compiler! Flow control, analysis, register allocation, code generation! I just want to tweak the language, not invent a whole new one from scratch!’

Student M counters “Who said you need to do all that? A compiler is simply a program that translates code from one language to another, right? If you can express the constructs of language L' (or L'' etc.) in terms of language L, then that's all you need to do. If you're really just tweaking the language, then your ‘compiler’ is mostly the identity function.“

‘And the parts that aren't?’

“Macros.”

Is student M's solution reasonable?

Thursday, April 15, 2010

Another wacky idea

Student T exclaims “I have a great idea!

“Sometimes I want to make a computer language that is very similar to an existing language with just a couple of exceptions. Wouldn't it be cool if we could just tweak the semantics of the existing language?”

Student A asks ‘What do you mean? What part of the semantics?‘

“Oh, I don't know... Anything!”

‘Anything? Anything at all?’

“Sure. Why limit ourselves?”

‘Like, say, adding or removing special forms?’

“Of course. Anything!”

‘How about changing from call-by-value to call-by-name?’

“That would be harder, but why not? Imagine you're running a program and you realize you need lazy evaluation for some part of it. Well, you just turn on lazy evaluation for that segment of code and turn it off again when it returns! Voila!”

‘What? While the code is running?’ Student A thinks for a moment and says, ‘So in essence you want the language semantics to be a function of time.’

Student T replies “No no no. I just want to be able to change them on the fly if I want.”

Student A says ’Yes, that's what I mean. At different times you could have different semantics.’

Student T says “Yes, but only if you change them.”

‘And how do I change them?’

“You just call a special primitive or something from your program.“

‘So if the language semantics can change over time, doesn't that imply that the meaning of a program can change over time as well?’

Is Student A right?

Wednesday, April 14, 2010

Let me propose these thoughts: a “semantics” is a way of associating meaning with programs. A “formal semantics” uses well-known mathematical models in an attempt to be thorough, precise, and consistent. An “informal semantics” uses human language, analogies, and examples in an attempt to be more concise and easier to understand.

As a programmer, you need to have some idea of semantics in order to get anything done. Even something as simple as the expression x + 1 presupposes that the symbol + has something to do with addition, the symbol 1 names a literal number, and the symbol x names something that can be understood numerically. Your understanding doesn't have to be complete (how does that supposed addition handle overflow?), and it doesn't have to be correct (maybe x is a string and + performs concatenation), or even consistent, but you can't work with no understanding whatsoever.

Does that seem like a reasonable point of view?

Tuesday, April 13, 2010

It's clear that you don't need to fully understand the formal semantics of a language in order to program in it. (There are a lot of languages for which the formal semantics haven't been specified.) How far could you get without knowing the informal semantics?

Monday, April 12, 2010

Student Z has another complaint

Student Z raises his hand in class. He says “Ok, so I guess I see a reason to do something with code other than run it. But I still think there's too much academic wankery in this field. Look at all these discussions of meaning and semantics. Who cares? Your average hacker doesn't bother with semantics, he just wants to get the job done.”

Well, obviously I'm setting up Student Z as a straw man, but how important are the semantics of a language? Are they simply a curiosity for ‘language lawyers’ and mathematicians to worry about, or are they a tool that is used daily by programmers?

Saturday, April 10, 2010

Improvements

Of course the next day after you upload something you find bugs. If you go to http://code.google.com/p/jrm-code-project/source/browse/#svn/trunk/PStore you will find not only updated versions of the object-map and persistent tree code, but also a very primitive serializer and deserializer (using John Cowan's suggesting of read and write).

Friday, April 9, 2010

Some code

If you're following along with the persistent storage thread, you may want to take a look at http://code.google.com/p/jrm-code-project/source/browse/#svn/trunk/PStore. You will find two Scheme files. object-map.scm implements an object-map based on a persistent binary tree that is in persistent-tree.scm. You will have to provide the serialize and deserialize procedures yourself.

Thursday, April 8, 2010

At this point, we can save and restore primitive objects, but restoring is somewhat painful because we have to remember the address that the durable storage layer returned to us. We need a way to associate an identifier of our own choosing with the durable address. We'll define a structure called the object-map that is basically a big lookup table that maps logical object ids to the addresses that the durable storage layer uses.

The object-map will be used a lot, so it is important to come up with an implementation that has decent performance. The object-map will also have to be stored durably, so it had better not get unwieldy when it is full of objects. The choice of implementation affects the performance of the store as a whole, so there are dozens of variations on several different themes. I'll discuss just a couple.

Perhaps the simplest and most obvious implementation would be a simple vector of durable addresses. Object N's address would be stored in the Nth entry in the vector. Naturally, this is very fast. Unfortunately, every time we make a logical change to the table, we have to write the entire table back out to the durable storage. (Our durable storage layer has no mechanism for `update in place'. This is by design, but the advantages are not apparent yet. One particular disadvantage is quite obvious now.) If object IDs are permitted to be removed from the table, the table could become sparse. In that case, a simple vector would be a tremendous waste of time.

Another possible implementation would be to represent the object-map as an association list. This would greatly improve the performance of persistent object allocation because only the new association list entry would have to be written to durable storage. Lookup now becomes linear, however. Removing an object from the association list would require writing back every entry between the object and the most recently allocated object.

Another possibility is using a vector as before, but rather than saving the entire vector just save the changes to the table. This technique degenerates into the association list case if the table is never written back.

A popular implementation technique is to use a tree of some sort. A binary tree has the advantage of logarithmic lookup time and, if done cleverly, log time and space insertion and deletion. A tree with a larger branching factor would be faster on lookup, although it would require larger update writes. If the durable storage is structured in physical blocks, it can greatly improve performance if a tree-structured object map uses nodes that correspond to the physical block size.

Regardless of the underlying implementation technique, the API is simple:


object-map/add object-map object-id object
  => new object map

Returns a new object-map that contains all the mappings of the
argument object-map in addition to a new mapping from
object-id to object.  If object-id had a
previous mapping, that mapping is not accessible in the result. 

Object will be serialized into the durable store associated with the
object-map.  The object-map itself will also be serialized into the
store.


object-map/address object-map
  => a durable address

Returns the durable address at which the object-map was serialized.


object-map/create durable-store
  => new object map with no entries

Argument durable-store is forever associated with this object
map.


object-map/durable-store object-map
  => a durable store

Returns the durable-store is associated with this object map.


object-map/lookup object-map object-id
  => an object previously added
     Error if object-id was never added.

Returns the deserialized object associated with object-id.
The object is deserialized from the durable-store associated with the
object-map.


object-map/recover durable-store address
  => object-map or
     Error if address does not conain an object map.

Deserialize an object-map from a durable store at the given address.


;; Optional API items.
object-map/lookup-address object-map object-id
  => the durable address of an object previously added
     Error if object-id was never added.

Returns the durable-address of the serialized object associated with
object-id.  This function will no doubt exist in any
implementation, but the question is whether to export it.


object-map/remove object-map object-id
  => new object map

Creates a new object map that is like the argument
object-map, but does not contain a mapping for
object-id.

Exercise 3: Implement the object-map API on top of the durable storage and serialization APIs.

As an aside. This random excursion into persistent storage has nothing to do with the “Professor and Students A, B, and C” thread. I'm trying to think of the right was to continue that thread and started this unrelated thread as an amusement.

Wednesday, April 7, 2010

Serialization

I was going to write up something on serialization, but when I started researching it I ended up falling down a rabbit hole. Nearly everything that could be said (whether right or wrong) has been. I'll just touch on a couple of points.

By abstracting out the durability layer, we can ignore (for the moment) the issues of data integrity and compression. We'll assume that the durability layer will provide a suitable level of assurance that the data returned by a call to read-bytes is exactly the same as the data previously written. Furthermore, we'll assume that the durability layer will compress the data if it has a need to.

For our purposes, we only need to serialize data in order to turn it into bytes that can be durably stored. There is one thing to keep in mind, though. We don't know how long we may end up keeping this data. We don't want future readers of the data to have to make too many assumptions about what it is we serialized. We should explicitly tag every datum we store with meta-data that tells us what it is.

Primitive data doesn't need much in the way of metadata. There should be a tag indicating the primitive type and some indication of the version of the serialization format. You might even dispense with the version tag (how many different ways will you be storing integers, anyway?). If you choose a reasonably standard serialization format, it is much more likely that you'll be able to get at your data in the future even if you lose all the information about how it is encoded.

Exercise 2: Pick a binary encoding format and implement an encoder and decoder for these Lisp/Scheme data types: integer (union of fixnum and bignum), character (Unicode code point), float (IEEE 754 double at the very least), UUID (RFC 4122), boolean, rational, complex, string (use UTF-8 unless you have a good reason not to). The decoder should be able to take any sequence of bytes generated by the encoder and return an object equivalent to the input (for an appropriate selection of equivalent). Don't do symbols or cons-cells yet.

For ChangeSafe, I used a single byte tag to encode the primitive serialized data. The tag was chosen by finding a mnemonic ascii code for the data:

(defconstant serialization-code/fixnum    (char-code #\#))
(defconstant serialization-code/keyword   (char-code #\:))
(defconstant serialization-code/string    (char-code #\"))

This allowed me to read serialized data into Emacs and have a few visual hints as to what it meant.

Tip: When you set up an enumeration like this, it is a good idea to define code 0 and code 255 explicitly as illegal values. This will catch a whole slew of accidentally uninitialized data.

Monday, April 5, 2010

I spent a lot of time over the weekend thinking about persistent data. I was trying to come up with a set of base abstractions over which to build a persistent store. I haven't found the best set, but I found part of a good set that can be refined to be better.

You need two things in order to effectively use persistent data. The first is a ‘durable store’ which we will use to remember the information, the second is a management system that gives us control over what is stored and retrieved and acts as an intermediary between transient programs and the persistent data they need. The basic interface between the durable store and the management system is this:

(write-bytes! store-name byte-vector) => address

(read-bytes store-name address) => byte-vector

write-bytes! is the primitive means of saving information. The store-name argument identifies which particular durable store save the bytes in, and the byte-vector argument is a vector of small numbers for the store to save. When invoked, the store should save the numbers somehow (more on this later), and return an address, which is perhaps simply an integer.

read-bytes is the primitive means of recovering information. Again, the store-name argument identifies which particular durable store to read the bytes from, and the address is a value that was previously returned by a call to write-bytes! on that same store. It should return a byte vector whose contents are equal to the one that was stored.

We'll make a very simple requirement for the behavior of this API. write-bytes! may either succeed or fail. If it fails, we require it to not return a success indication. For this API, we assume that if write-bytes! returns an address, that it succeeded. Furthermore, read-bytes may also fail or succeed, but if it succeeds it must return the same bytes that were given in the call to write-bytes! that produced that address.

We'll need to extend this API a bit, but it is an ok start.

This API gives us a mechanism for durability, but it isn't a very useful one. What if you want to save something other than byte vectors? How do you remember the address that you got back? This will be the responsibility of the management layer.

Exercise 1: Implement the above API using the file system.

Friday, April 2, 2010

In that last post Student Z made the ridiculous comment that he didn't want to analyze code, he just wanted to run it. I didn't make this up. In an argument I was having a few years back the other person wrote

This is the overriding concern of the FP advocate. They want to reason about their code. This is a strange proposition to the vast majority of industry. They don't want to reason about it, they want it to be easy to implement and run fast!

... I've always come at these problems from the industrial standpoint, not the academic theoretical I-want-to-reason-about-it standpoint.

Well, you do find trolls on the internet, but I was surprised to see a professor say:

I would cheerfully give up [the ability to off-line analyze a file] for the sake of [a top-level REPL and a model where loading a file is just like “typing it in”].

Analysis of code doesn't just mean ‘automated theorem proving’. It also means being able to casually look over the code and get a clue as to what it is doing. Just looking at what identifiers are defined in a program and which are free can give a lot of the context. But as you can see, you cannot just slavishly evaluate each form in turn in order to understand the entire program.

As to Student B's question, “How you can tell what you need to evaluate and what you don't?” I think I was too vague in presenting the question or too caught up in anthropomorphization. We need to step back from what the interpreter does with the code (evaluation) and abstract from it (abstract evaluation). Student B, of course, wouldn't be expected to understand this, but the professor should, and he ought to be able to frame the issue in a way that Student B can at least follow.

For the specific example of this quiz, you can simply say ‘ignore any top-level form that isn't a definition’. This is usually a pretty good rule of thumb, and I'll start there, but we need to be a lot more precise. (Since the questions, if poorly stated, at least have agreed upon answers, we ought to be able to precisely specify why our answers are correct.)

Thursday, April 1, 2010

Back in the classroom Student Z complains “You're talking about analyzing code. I don't want to analyze code. That's ivory tower crap. In the real world, they write code and run it. Why should I bother learning this stuff?”

Is this a reasonable viewpoint?

Student B has a more technical question. “So I think I'm getting this idea that you want to be able to understand the code without simply evaluating it, but what I don't get is how you can tell what you need to evaluate and what you don't. If you don't evaluate anything, the how could you know that (define foo (lambda (x) ...)) is going to define a name as a procedure? You sort of have to do some evaluation. But then how do you know that you don't have to evaluate the ‘sanity check’ code?
“I guess what I'm asking is: are there some rules I should know, or is this just seat-of-your-pants stuff?”

In other news, I'm thinking about how to do a simple persistent store in Scheme/Lisp. I do a fair amount of database type stuff at work and I'm pretty sure there a lot of improvements on what I'm seeing. It seems to me that the second year course in computer science should deal with persistence and databases, and that a “Structure and Interpretation of Persistent Data” or a “How to Design Databases” would be a great book.

So in order to play with these ideas, I'm putting some work into MIT Scheme. The first thing I wanted was ‘:keyword’ objects — self-evaluating symbolic constants. I prefer the leading colon style, but I know others prefer the trailing. SRFI-88 specifies trailing, but many versions of Scheme have leading (or both). I decided to have a way to switch between the two styles.

CPH suggested that switching should be on a per-file basis, like case-sensitivity. Neither one of us like the #!fold-case #!no-fold-case flags that R6RS suggests. (Read Clinger's rant). After a bit of casting around for ideas, the one that made the most sense was to put the case folding and keyword style in the ‘file attributes line’ at the top of the file.

;;; -*- Mode: Scheme; keyword-style: prefix -*-

That was slightly painful to implement because there is a large variability in the syntax of the file attributes line and I wanted to provide a ‘soft landing’ for strangely written lines.

The upshot is that now I can use keywords (in prefix style!) and correctly interact with code that is written with trailing-colon keywords or with no keywords at all. (Although the latter will find it fairly nasty to invoke to my code: (foo (string->keyword "a") 22).)