Wednesday, July 18, 2007

Well, I wrote some clustering code, but it isn't staggering into life yet. I've been distracted by a new problem at work: making sense out of tens of thousands of data points. We have a number of computers that the QA group uses. I had the idea that we should probe them and collect some statistics. With enough statistics, maybe we'll see some trends or something. I wrote a small script that goes to each QA machine, pings it, and if it is alive, logs in and collects a bunch of basic info like physical memory in use, swap space in use, load average, number of threads, etc. The results are stored in a file with the machine name and a timestamp for when the sample was taken. Over the past couple of weeks I've collected over 100 thousand samples of about 13 different values in each sample. (Parts of the sample are pretty static: the total size of physical memory doesn't change, and the operating system is only changed rarely, but these are interesting values to have at hand.) The problem now is to extract some information from this pile of raw data. This isn't as easy or obvious as I had thought it would be. My first experiment was to examine the load averages. I figured that maybe we could tell the difference between machines that were in use and those that were idle, or between correctly running machines and broken ones. Maybe we'd see a correlation between the amount of RAM and the load. Little did I know.... (more later)