It is one of the most ubiquitous problems in software engineering: finding a place to store some data.
We may have multiple shards of a database, multiple disks in a computer, or just multiple files in a folder. We want our data to be spread out evenly across these buckets, but that is easier said than done. Searching for the least utilized bucket comes with a performance penalty, doing something like a round-robin allocation requires state, and choosing at random will eventually lead to a pretty broad (standard) distribution of fill levels across the buckets.
steps = 1000
buckets = 10
data = np.zeros((steps, buckets))
for i in range(steps):
data[i] = data[i-1]
idx = random.randint(0, buckets - 1)
data[i, idx] += 1
Surprisingly, if we just select two random buckets instead of one and select the less utilized one for storage, the problem all but vanishes, yielding an exponentially better distribution.
for i in range(steps):
data[i] = data[i-1]
idx_a = random.randint(0, buckets - 1)
idx_b = random.randint(0, buckets - 1)
idx = idx_a if data[i, idx_a] < data[i, idx_b] else idx_b
data[i, idx] += 1
This phenomenon is well-studied in load-balancing and hashing theory and is known as the Power of Two Choices, yet it remains my favorite this-really-should-not-work-that-well algorithm.