The unreasonable efficiency of a random comparison

It is one of the most ubiquitous problems in software engineering: finding a place to store some data.

We may have multiple shards of a database, multiple disks in a computer, or just multiple files in a folder. We want our data to be spread out evenly across these buckets, but that is easier said than done. Searching for the least utilized bucket comes with a performance penalty, doing something like a round-robin allocation requires state, and choosing at random will eventually lead to a pretty broad (standard) distribution of fill levels across the buckets.

steps = 1000
buckets = 10
data = np.zeros((steps, buckets))

for i in range(steps):
    data[i] = data[i-1]
    idx = random.randint(0, buckets - 1)
    data[i, idx] += 1

Random allocation

Surprisingly, if we just select two random buckets instead of one and select the less utilized one for storage, the problem all but vanishes, yielding an exponentially better distribution.

for i in range(steps):
    data[i] = data[i-1]
    idx_a = random.randint(0, buckets - 1)
    idx_b = random.randint(0, buckets - 1)
    idx = idx_a if data[i, idx_a] < data[i, idx_b] else idx_b
    data[i, idx] += 1

With comparison

This phenomenon is well-studied in load-balancing and hashing theory and is known as the Power of Two Choices, yet it remains my favorite this-really-should-not-work-that-well algorithm.