Last night I had an ‘Aha!’ moment. We can look at the distribution as the probability that a result will be ready when we poll (that's not the ‘Aha!’). But we can pretend that we made all the RPCs at exactly the same time. If we had, then this chart shows the fraction of RPCs that finish as a function of time (not a big ‘Aha!’, but I'll take it). So if we were to poll at 2 seconds, we'd find about 20% of the RPCs have completed, but if we wait until about 4.5 seconds we'd find 90% have completed.
What if we polled at 2 seconds and 4.5 seconds? Well, the 20% that completed before the poll at 2 seconds will be serviced right after that poll, but any that become ready between 2 and 4.5 seconds will have to wait until the later poll. That will be
90 - 20 = 70%. That is to say, the efficacy of the poll — the marginal increase in likelihood of getting a response — is proportional to the vertical difference in the graph divided by the time since the last poll. Or, to put it yet another way, the polling frequency should be proportional to the derivative of the distribution. Aha!
An empirical measurement of the derivative would have a lot of noise. Fortunately, we noticed that the cumulative distribution is closely modeled by a cumulative log-normal curve. Cumulative means integral. The derivative is therefore simply a log-normal curve centered at the median time (about 3 seconds). (Is this an ‘Aha!’, or a ‘D'oh!’ moment?)
I think I have enough of an understanding of this problem to write some code.