Saturday, September 5, 2009

But *How* Random Is It?

Folks use /dev/urandom and $RANDOM to generate uniformly-distributed random numbers. Are they random? Yes.

Let's test. Here's how.

Surprise! The easiest way to start is to ask a more general question: Suppose you have a couple of sets of numbers. Are they from the same distribution?

For example, if you have a couple of batches of 10,000, 8-digit numbers, from the two different distributions, are they really spread out from 0 to 9999 the same way, or do they have clumps in different places?

There are ways to attack this, even if you don't know (or care) what the original distribution was. Here's a step-by-step walkthrough:
  • Sort them together, smallest to biggest, coloring the first set red and the second, green.
  • Put your finger on the middle of a line--zero on the number line--and begin reading off colors. If the number's red, move your finger left. If it's green, move it right.
  • See how far away from the origin you wander.
Half the numbers are red, half green, so you'll end up back where you started.

How far away might you get along the way? Depends on the distribution of the reds and greens. If all the reds are smaller than all the greens, you'll go left to ten thousand, then turn around and come right back. If they're all bigger, then you'll get ten thousand away to the right before you snap back like a yo-yo. And except for these cases, you won't go as far before you return.

If the numbers are from the same distribution, then whether they're red or green is a coin-toss. The distance I expect to get from the the origin goes up as the square-root of the size of my collections.

(That's on average -- any single pair of batches could end up almost anywhere. "... the 'one chance in a million' will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be ...." -- R. A. Fisher)

There are lots of ways to get this result:
  1. from the standard deviation of a binomial with p=0.5
  2. from a two-sample Kolmogorov-Smirnov statistic with equal sample sizes
  3. from looking at it as Brownian motion, using the root-mean-square (RMS) argument attributed to Einstein, in Feynman, volume 1, chapter 6
All roads lead to Rome.
Note: nothing I've said, so far, depends on the numbers being random--this just tests whether the red and green numbers come from the same distribution. In what follows, though, I'm going to ask whether batches spit out by random-number generators are uniformly-distributed.
Let's write code. And, since all code roads lead to Rome, too, I'll do it in the shell.

Start with two, random-number-generators, R1 and R2 that generate some fixed-size batch of random numbers.
  • Check that each one spits out the same sized batch.
$ R1 | wc -l ; R2 | wc -l
  • Mark each output with a second field, saying which way each number will move my finger
$ R1 | awk '$2=1'; R2 | awk '$2=-1'
  • Sort the two together, on the first column
$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1')
  • Run through the output, keeping track, at each step, of how far I've gotten away from the origin
$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1') | awk 'pos+=$2; d = ( pos <>
Notice how I'm just recalling the last command and editing it. All this is on the command line, where I can watch the effect of each step.
  • Print just the the biggest distance I get from the origin, for the whole trip

$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1') | awk 'pos+=$2; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1
Easy enough. I'll try it first with a script that uses the Halgorithm:
#!/bin/bash

head -10000 /dev/urandom | tr -dc 0-9 | perl -pe 's/(.{8})/\1\n/g' | head -${1:-10000}
Whatever argument I give it on the command line tells it how many eight-digit integers to generate. [Default: 10,000]

Two runs should produce different batches, but the same distribution. I bring back the earlier command and do half a dozen runs with 100 numbers from each batch.
$ for i in {1..6}; do sort -n <(R1 100 | awk '$2=1') <(R1 100 | awk '$2=-1') | awk 'pos+=$2; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1 ; done
13
20
6
7
10
10
The average? 10.6 .

(Careful! I'm comparing the batch produced by one run of the generator with the run produced by a different batch. It tells me what to expect for different batches from the same distribution.)

More runs, a hundred instead of a dozen, just gives a more accurate average: 12.2

If it varies with the square of the size of the batches, batches of 10,000 should give numbers nearer to 120.
$ for i in {1..6}; do sort -n <(R1 | awk '">=1') <(R1 | awk '">=-1') | awk 'pos+=">; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1 ; done

90
97
71
91
105
127

Good. The average value, over 100 paired runs, is about 125.

This, finally, lets me compare Hal's method with the shell-only method from an earlier post, which uses $RANDOM? Does $RANDOM have a different distribution from /dev/urandom?

Nope. The average distance is, again, about 125. If they're non-uniform, they're non-uniform in the same way.

But maybe neither is random. What if I compare them with a batch of uniformly-distributed, true random integers, downloaded from the web?

About 135. A little higher, but not high enough to set off a panic alarm. $RANDOM and /dev/urandom both give pseudo-random-numbers that are, roughly, uniformly distributed.

1 comment:

Nick said...

There's a fairly important abstraction you left out: what if you have a random number generator that always prints 12 or -12? Or even one that always gives something divisible by 12?