Tuesday, January 7, 2020

Remotely like a procedure call

The moment you connect two computers together with a wire, you introduce a number of failure modes that are very hard to deal with: undeliverable messages, fragmentation of messages, duplication of messages, lost messages, out-of-order arrival, and untimely arrival come to mind. There are probably some I am forgetting. With a bit of engineering, you can eliminate some of these failure modes, or reduce the likelihood of them occurring, or trade the likelihood of one kind for another, but ultimately, when you send a network message, you cannot be sure it will reach its destination in a reasonable amount of time, if at all.

Usually, in a network, the computer that initiates the interaction is called the client, while the computer that performs the action is called the server. There are usually several computers along the way — routers, proxies, and switches — that facilitate the interaction, but their job is to act as a wire in the non erroneous case, return responses if needed, and to report errors back to the client if necessary and possible.

There are several taxonomies of network errors, but one useful way I have found to categorize them is by what sort of actions you might want to take to deal with them. If you are lucky, you don't have to deal with them at all. For instance, it might be reasonable to simply lose the occasional remote info logging message. Or if a web page fails to be delivered, it might be reasonable to rely on the recipient eventually hitting the reload button. This isn't much of a strategy, but it is easy to implement.

Many times, it is possible for the client to detect that the interaction failed. Perhaps the server returns an error code indicating the service is temporarily unavailable, or perhaps a router or switch along the way indicates that it cannot deliver a packet. This isn't the desired state of affairs, but at least the client can come to the conclusion that the interaction didn't take place. Depending on what is causing the failure, there are essentially only two options to proceeding: trying again and giving up. If the error is persistent or long-term, trying again is futile and your only option is to defer the operation indefinitely or abandon it altogether. A robust, distributed system has to be prepared to abandon interactions at any time and have a strategy for what do to recover from abandoned interactions. (For example, it could pop up a message box indicating a service is not available, or it could send an email saying that it will process the interaction at some undetermined time in the future. I suppose crashing is a strategy, but it stretches the definition of “robust”.)

Trying again, usually after a short wait, is often an appropriate strategy to deal with transient errors. For instance, an unreachable server may become reachable again after a routing change, or a service could get restarted and start serving again. It may not be possible to tell the difference between a transient error and a persistent one until you have retried it several times and have had no success. In this case you need to fall back to the persistent error strategy of deferring the operation indefinitely or abandoning it altogether. If retrying is part of strategy of handling transient errors, it becomes necessary for the server to deal with the possibility of handling the same message multiple times. It should, through use of timestamps, versioning, or other idempotent mechanisms, be able to ignore duplicate requests silently. Otherwise, the original error is handled through a retry only to cause the server to raise a duplicate request error.

An even less desired failure mode is when a message is lost or cannot be delivered, and this cannot be determined by the client. The message gets sent, but no reply is received. You are now left guessing whether it was the original message that was lost or the just the reply. The usual strategy is to eventually time-out and resend the message (essentially, you inject a transient failure after a pre-determined amount of time and then handle it like any other transient failure). Again, the server has to be designed to see a message a multiple number of times and to handle duplicates silently and gracefully. Also again, the situation may become persistent and the persistent fallback strategy has to be used.

No one size fits all, so network systems come with libraries for implementing various strategies for error recovery. A robust, distributed system will have been designed with certain failure modes in mind and have an engineered solution for running despite these errors occurring once in while. These involve designing APIs that can handle stale data through versioning timestamps and a reconciliation process, idempotent APIs, temporarily saving state locally, designating “ultimate sources of truth” for information, and designing protocols that reach “eventual consistency” given enough time. There's no panacea, though — no programming technique or trick that simply results in a robust, distributed system. A robust, distributed system has to be designed to be robust despite being distributed and you need some experienced people involved in the design and architecture of such a system at the beginning so that it will be robust once it is implemented. There is actual non-trivial work involved.

(It amuses me to see that some people understand part of the problem without understanding much about the underlying tools. I've seen packet timeout and retry mechanism bolted on to TCP streams, which already have this built in to their abstraction.)



Enter the remote procedure call. If nothing goes wrong, a network interaction that involves a return receipt bears a remote resemblance to a procedure call. The syntax can be made similar to a procedure call, it's just that the call begins over here, the computation happens over there, and the return value is sent back over here. The major difference being that it is significantly slower (and just forget about tail recursion. Although there's nothing in theory that would prevent a properly tail-recursive RPC, I've never seen one.)

RPCs are a weak, leaky abstraction. If everything goes well, the network interaction does indeed appear as if it were a (slow) procedure call, but if there are problems in the network interaction, they aren't hidden from the caller. You can encounter persistent errors, in which case the caller has to be prepared to either defer the call indefinitely or abandon it altogether. You can encounter transient errors, which suggests attempting to retry the call, but it may not be clear whether an error is transient or persistent until you've retried and failed several times. If retrying is part of the error recover strategy, then the callee has to be prepared to receive duplicate requests and discard them or handle them in some other (presumably idempotent) manner. In short, none of the errors that arise from network interactions are abstracted away by an RPC. Furthermore, network errors are often turned into RPC exceptions. This seems natural but it makes it difficult to separate the strategies for handling network error from the strategies of handling other exceptions.

A featureful RPC mechanism can aid in figuring out what happened in an error case, and can provide various recovery mechanisms, but it is still up to the programmer to determine what the appropriate recovery strategy is for each error condition.

RPCs are by their nature a late-bound mechanism. There has to be machinery in place to resolve the server that will handle the network interaction, route messages, and possibly proxy objects. All too frequently, this machinery simply doesn't work because sysadmins thoughtlessly apply over broad firewall rules that prevent the appropriate name resolution and proxying from occurring. To get around this problem some systems have resorted to using HTTP or HTTPS as the underlying RPC mechanism. These protocols are rarely blocked by firewalls as it would render it impossible for sysadmins to watch videos.

HTTP and HTTPS were designed as protocols for transferring web pages, not for general remote procedure calls. For example, errors are handled by returning a 3-digit code and small blurb of text. Shoehorning a featureful RPC mechanism into web interactions is difficult, so we are left with shuffling JSON objects over HTTP connections as our poor-man's approximation to a well-designed RPC mechanism.

RPCs can be a useful, if limited abstraction. If you are using them, you need to be careful because they don't abstract away the problems of distributed systems and it becomes all to easy to build a fragile distributed system rather than a robust one.


Addendum: It seems that I'm not the only one having problems commenting on things. Arthur Gleckler tried to write a comment but it simply got eaten and returned him to the posting page. I don't know what John Cowan is doing right, but he appears to be able to post comments just fine. Google has this unfortunate habit of letting bit rot set into their products. Then, once enough of it stops working, they kill the product or try to replace it with something no one wants. Anyway, here's what Arthur had to say:
Speaking of tail-recursive RPCs, when Google was redesigning its RPC system, I tried to get the new designers to implement a simple tail-recursive capability so that client A could call server B and receive a response from server C. That could have made a lot of common patterns at Google faster, since it was common for A to call B, which called C, which called D, and so on, e.g. to try different mechanisms for handling a particular search query. Unfortunately, I wasn't persuasive. Frankly, it's hard for people who haven't spent much time with tail-recursive languages to see the benefit, and these were hard-core C++ hackers.

1 comment:

John Cowan said...

That's why it's much better to use remote pipelines instead of RPCs. They can still fail after being set up, but most of the time the best recovery mode is to crash and restart the whole pipeline, as one does on the command line.

I once provided (local) pipes in an OS that had no notion of them by writing a multi-threaded server that matched senders with receivers using its own internal buffers. Fortunately any server could register itself in the file system by the analogue of /servers//, so mine was /servers/pipes/n for some integer n.

Of course, then I had to write a shell that started processes with standard input and/or output set to point to some port on the pipe server, which read all incoming packets labeled with their port. Then I had to port the K & P Software Tools, as there was a Fortran compiler at least. Text files were an issue too: the editor produced mini-databases that only the editor itself knew how write to, though there was a library for reading them. So the Tools wrote ordinary sequential files and read either sequential or edit files indifferently.

Ah, life in the bad old days.