This is the response time of the Google Mini search appliance on several thousand queries. There is an odd spike at 300 ms. A number of machines were taking exactly 300 ms to respond regardless of the query.
I soon tracked this down to the spell server. This is the microservice that puts “Did you mean foobar?” on the top of the page when you spell something wrong. The results page generator will wait up to 300 ms for the spell server to come up with its suggestions and then it will proceed without it. What appeared to be happening is that the spell server was giving a “no go” signal to the page generator at the beginning of page composition. The results page generator would then wait for the spell server to make a suggestion. The spell server would invariably take too long, so after 300 ms, the results page generator would time out and ship the page as is. This happened often enough that it showed up as a blip in the performance graph.
The spell server was based on a Bloom filter. A Bloom filter is a variation on a hash table where you only record that a bucket exists, but not its contents. A Bloom filter will quickly and reliably tell you if an entry is not in the table, but only probabilistically tell you if an entry is in the table. Bloom filters rely on having a good hash function and having a mostly empty hash table. If the hash table is mostly empty, the Bloom filter will usually end up hitting an empty bucket and returning false immediately.
A quick look at the spellserver showed that the Bloom filter was almost always getting a hit, this would send the “no go” signal and trigger a slow lookup to find the misspelled word. The problem was that the Bloom filter was too small. It had too few buckets, so most buckets had an entry. Words would always get a hit in the Bloom filter and so the search appliance thought that all words were misspelled.
I adusted the size of the Bloom filter, giving it several million buckets. Now correctly spelled words would likely hit an empty bucket and the filter would give a “go” signal to the response page generator. Problem solved, and the spike at 300 ms went away.
Why was the Bloom filter undersized and not being automatically resized to reduce the false positives to a reasonable rate?
ReplyDeleteThat would have been nice, but that's not what the author did. As it was written, the size of the Bloom filter was a constant fixed in a config file that was read upon booting the machine.
ReplyDelete