Friday, August 28, 2009

Hash Collisions in Git

This post would be easier if I knew how to post formulas in Blogger. If you know, tell me.

In the Pro-Git Book, Scott Chacon notes that to have a probability of a SHA1-hash collision rise to 1/2, you need about 10^24 objects (more, he says, than the number of sand grains on earth).

Where's this come from? It's a classic birthday problem. If we have K people, with birthdays on any of 365 days, what are the odds of two or more people having the same birthday?

P(same) = 1-P(everyone's different).

To assign different birthdays to K different people, we can give the first person any of 365 birthdays, the next any of the remaining 364, and so on. The Kth person has a choice of 365-K+1 days.

Turning this into a probability means dividing by 365^K -- the number of different ways to assign birthdays to K folks if you don't care about collisions.

But what's that product? Just multiply it out:

X = 365(365-1)(365-2) ... (365-K+1) = 365^K - [1+2+...+(365-K+1)]365^(K-1) + something,

Where something is O(365^(K-2)) (That's big-Oh notation, if it's not clear.)

Dividing by 365^K gives, approximately, P(no collision) = 1-[K(K-1)/2]/365,
or P(collision) = (1/365)K(K-1)/2 .

For SHA1 sums, we replace 365 by the number of 160-bit SHA1 values: 2^160.

How big does K have to be for (1/2^160)K(K-1)/2 to be about 1/2? About
K ~ sqrt(2^160) = 2^80, or about 10^24!

Another quick applications of this idea comes from the fact that the SHA1 prefix git is now requiring to specify unique kernel objects is 10 hex digits long. We can, therefore, guess the database now has something on the order of 2^20, or a million, distinct objects.

And then there's the original birthday party question. If you have sqrt(365) people in the room -- that's 17 or 18 -- you have about a 50-50 chance that two will have the same birthday.

This would, however, been clearer if I could have stuck real formulae in the post.

No comments: