Wednesday, October 14, 2009

Short exercise

I seem to have solved the bug in my flat environment code. I'll post on that in a day or two.

An interesting problem came up at work a couple of days ago, and it makes for a good engineering problem. Let me see if I can describe this in an interesting way.

A database application generates charts that describe certain things about the data. For instance, you might generate a chart of a moving average of field X for all entries where the username is ‘jmarshall’. This is pretty standard stuff for a simple database. There's a problem, however. For various reasons, some of the information we like to plot is stored elsewhere and we have to fetch it. This is a heavyweight request that takes 150ms. We have on the order of 10K records, and we typically examine several hundred to create a chart. If we need the remote data, we're looking at several hundred of these heavyweight requests. A big chart can take two and a half minutes to draw and the user gets very impatient.

A cache was added in order to make things remotely bearable. The database changes relatively slowly (a handful of records a day), and the charts are not designed for pinpoint accuracy, so there is no requirement that the cache be completely fresh or for it to provide a transaction consistent view of the data. A very simple cache that remembers the data for a few hours is acceptable. This alleviated a lot of the pain because now users could generate different variations of their charts and compare them in real time.

But there is still the problem with the ‘startup transient’. If no one has recently generated a chart, that first one you make will take forever as the cache is loaded. So the problem is getting rid of the startup transient.

I'll give the following hints:
  • Assume memory is not an issue.
  • Assume multiple writers to the database and the remote data. You will not get ‘update’ notifications.
  • Several instances of the application may be running at once, but they cannot communicate with each other. (So an application must not saturate the database with requests.)
  • Fetching part of a record or the additional info is no cheaper than fetching the whole thing, so a timestamp or checksum comparison will not be faster.
  • The data does not have to be transactionally consistent.
  • The data can be somewhat stale, but there should be a limit.
The problem is to come up with a cache refresh policy that avoids the startup transient and estimate how stale the data could become under that policy.