Wednesday, September 30, 2009

Grobycs: Cyborgs in Reverse

We've read about cyborgs, computer-enhanced humans, for decades. Misdirection? We should have been thinking about grobycs: human-enhanced computers. Cyborgs in reverse.

Consider these:
All use people as a piece of a collection of software. They're cyborgs in reverse: Grobycs.

The Six-Million-Dollar Man wanted her:

What do you think grobycs will lust after?

Wednesday, September 23, 2009

Variables Without Values

Often, you only want to do something if a variable is unset.

I see code to handle such conditions that looks like this.
if [ "A$foo" = "A" ]
then
take-some-action
fi
It's either old code or code by old programmers.

The test is how you used to have to ask whether a variable was empty. Here's a newer idiom.

[ "$foo" ] || take-some-action
The test operator, [ ], comes in many flavors. For example
[ -d $X ] asks "Does $X name a directory?"

In its simplest form, though,

[ $X ]
asks "Is $X empty?" The test returns true if $X has something in it, false if it doesn't. The new code does the same thing as the old, but it's shorter and, arguably, easier to read.

But why not this?

[ $foo ] || take-some-action
Ah. Because if foo='1 2 3', then test complains that you've given it too many arguments.

[: 2: unary operator expected
One more trick: what if you're running with "set -u", which complains whenever it stumbles on an unset variable, with complaints like this
line 3: foo: unbound variable

Write the test like this:
["${foo-}" ] || take some action
If $foo is unset, ${foo-} has a null value. "Null" is a value; "unset" really means the variable was never set. So, with "${foo-}" the shell won't complain about an unset variable, but the test will still fail.

This is still shorter than the example we started with, but it's only more readable once you can read the idiom ${foo-} without stumbling.

On the other hand, the first example will also choke if $foo is unset, so it would also have to be modified like this:
if [ "A${foo-}" = "A" ]
then
take-some-action
fi

More Include Guards

I've already written about how to make include guards for "shell libraries" (files full of shell functions or variables to be sourced).

This is the basic form of the statement, which requires tailoring for each library:
if [ ${_gripe_version:-0} -gt 0 ]
then
return 0
else
_gripe_version=1
fi
Here's a more generic statement that you can put at the top of a file to achieve the same goal:
[ "${_libs/${BASH_SOURCE[0]}}" = "${_libs=}" ] ||
return 0 && _libs+=" ${BASH_SOURCE[0]}"
A dramatic reading of this off-putting code is left as an exercise to the reader.

It has the added attraction that you can, at any point, find out which libraries you've already sourced with echo $_libs .

It has the added detraction that you'll have to cut-and-paste it. There's no way you'll ever keep it in your head and type it correctly.

As a good test, first try this infinite recursion without include guards. (You'll have to ^C out quickly, or you'll get stuck in a source-a-thon.)
$ echo 'source foo.sh' > foo.sh
$ chmod +x foo.sh
$ ./foo.sh # don't wait long to ^C
Next, edit foo.sh to add an include guard at the top and re-run it. This time, it will return immediately, without help.

Friday, September 18, 2009

Simplifying Loopy Code


The determined Real Programmer can write FORTRAN programs in any language.
I just read through a friend's shell script. He doesn't program in the shell much, but he's a superb C programmer, so his script has loops that look like this:
file[0]=foo
file[1]=bar
file[2]=mumble
suffix[0]=.c
suffix[1]=.h
suffix[2]=.txt

nfiles=${#file[@]}
nsuffixes=${#suffix[@]}


i=0while [ $i -lt $nfiles ]
do
j=0
while [ $j -lt $nsuffixes ]
do
process ${file[$i]}${suffix[$j]}
(( j = j + 1 ))
done
(( i = i + 1 ))
done
Yep. That'll work. But this will, too.
for filename in {foo,bar,mumble}.{c,h,txt}
do

process $filename
done
Bash sees filenames as strings, and most shell commands accept space-separated lists of filenames. The shell handles anything that expands to a list of filenames, like globs or brace expansions, with real ease.

Use the Shell, Luke.

Wednesday, September 16, 2009

Call Me Now

Click on the Google Call Me button, below, fill in your own name and phone number, and then click Connect.

You'll be connected to me immediately.



Saturday, September 5, 2009

But *How* Random Is It?

Folks use /dev/urandom and $RANDOM to generate uniformly-distributed random numbers. Are they random? Yes.

Let's test. Here's how.

Surprise! The easiest way to start is to ask a more general question: Suppose you have a couple of sets of numbers. Are they from the same distribution?

For example, if you have a couple of batches of 10,000, 8-digit numbers, from the two different distributions, are they really spread out from 0 to 9999 the same way, or do they have clumps in different places?

There are ways to attack this, even if you don't know (or care) what the original distribution was. Here's a step-by-step walkthrough:
  • Sort them together, smallest to biggest, coloring the first set red and the second, green.
  • Put your finger on the middle of a line--zero on the number line--and begin reading off colors. If the number's red, move your finger left. If it's green, move it right.
  • See how far away from the origin you wander.
Half the numbers are red, half green, so you'll end up back where you started.

How far away might you get along the way? Depends on the distribution of the reds and greens. If all the reds are smaller than all the greens, you'll go left to ten thousand, then turn around and come right back. If they're all bigger, then you'll get ten thousand away to the right before you snap back like a yo-yo. And except for these cases, you won't go as far before you return.

If the numbers are from the same distribution, then whether they're red or green is a coin-toss. The distance I expect to get from the the origin goes up as the square-root of the size of my collections.

(That's on average -- any single pair of batches could end up almost anywhere. "... the 'one chance in a million' will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be ...." -- R. A. Fisher)

There are lots of ways to get this result:
  1. from the standard deviation of a binomial with p=0.5
  2. from a two-sample Kolmogorov-Smirnov statistic with equal sample sizes
  3. from looking at it as Brownian motion, using the root-mean-square (RMS) argument attributed to Einstein, in Feynman, volume 1, chapter 6
All roads lead to Rome.
Note: nothing I've said, so far, depends on the numbers being random--this just tests whether the red and green numbers come from the same distribution. In what follows, though, I'm going to ask whether batches spit out by random-number generators are uniformly-distributed.
Let's write code. And, since all code roads lead to Rome, too, I'll do it in the shell.

Start with two, random-number-generators, R1 and R2 that generate some fixed-size batch of random numbers.
  • Check that each one spits out the same sized batch.
$ R1 | wc -l ; R2 | wc -l
  • Mark each output with a second field, saying which way each number will move my finger
$ R1 | awk '$2=1'; R2 | awk '$2=-1'
  • Sort the two together, on the first column
$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1')
  • Run through the output, keeping track, at each step, of how far I've gotten away from the origin
$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1') | awk 'pos+=$2; d = ( pos <>
Notice how I'm just recalling the last command and editing it. All this is on the command line, where I can watch the effect of each step.
  • Print just the the biggest distance I get from the origin, for the whole trip

$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1') | awk 'pos+=$2; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1
Easy enough. I'll try it first with a script that uses the Halgorithm:
#!/bin/bash

head -10000 /dev/urandom | tr -dc 0-9 | perl -pe 's/(.{8})/\1\n/g' | head -${1:-10000}
Whatever argument I give it on the command line tells it how many eight-digit integers to generate. [Default: 10,000]

Two runs should produce different batches, but the same distribution. I bring back the earlier command and do half a dozen runs with 100 numbers from each batch.
$ for i in {1..6}; do sort -n <(R1 100 | awk '$2=1') <(R1 100 | awk '$2=-1') | awk 'pos+=$2; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1 ; done
13
20
6
7
10
10
The average? 10.6 .

(Careful! I'm comparing the batch produced by one run of the generator with the run produced by a different batch. It tells me what to expect for different batches from the same distribution.)

More runs, a hundred instead of a dozen, just gives a more accurate average: 12.2

If it varies with the square of the size of the batches, batches of 10,000 should give numbers nearer to 120.
$ for i in {1..6}; do sort -n <(R1 | awk '">=1') <(R1 | awk '">=-1') | awk 'pos+=">; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1 ; done

90
97
71
91
105
127

Good. The average value, over 100 paired runs, is about 125.

This, finally, lets me compare Hal's method with the shell-only method from an earlier post, which uses $RANDOM? Does $RANDOM have a different distribution from /dev/urandom?

Nope. The average distance is, again, about 125. If they're non-uniform, they're non-uniform in the same way.

But maybe neither is random. What if I compare them with a batch of uniformly-distributed, true random integers, downloaded from the web?

About 135. A little higher, but not high enough to set off a panic alarm. $RANDOM and /dev/urandom both give pseudo-random-numbers that are, roughly, uniformly distributed.

Thursday, September 3, 2009

Another Random Post

Hal Pomeranz suggests this one-liner
head /dev/urandom | tr -dc 0-9 | sed -r 's/(.{8})/\1\n/g'
for generating big batches of random numbers. It's a winner: reasonable performance and easy-to-type. Timing information follows:
#!/bin/bash

echo == Shell arithmetic
(( R = 2**15-1, T = 10**8-1, C = T/R ))
readonly R T C

time for i in {0..10000}
do
printf "%8.8u\n" $(( RANDOM*C + ( RANDOM%C ) ))
done > /dev/null

echo; echo == Pipeline
time head -10000 /dev/urandom | tr -dc 0-9 | sed -r 's/(.{8})/\1\n/g' | head -10000 > /dev/null
Here's the numbers:

== Shell arithmetic

real 0m0.262s
user 0m0.256s
sys 0m0.004s

== Pipeline

real 0m0.421s
user 0m0.028s
sys 0m0.392s
Okay, the pipe's a little slower, but it's the same order of magnitude, and way easier to type.

Wednesday, September 2, 2009

The International Slide Rule Museum

From Mike Rosenlof:
They have (who am I kidding! HE has..) an exhibit at the Louisville Public Library.

http://sliderulemuseum.com/

Tuesday, September 1, 2009

Another Challenge From Hal & Ed: Random Numbers In the Shell

Hal Pomeranz and Ed Skoudis have a weekly blog, Command Line Kung Fu, where they set and solve interesting problems in both the Unix/Linux shell and Windows's CLI, COMMAND.COM.

This week's was how to generate random 8-digit integers.

Obligatory pedantic sidebar, which you can skip:

"Random" can mean a lot of things. If I flip a coin, and announce either "00000000" or "00000001", that's an 8-digit, random number. Ho hum. What I'm looking for here is integers uniformly distributed in the interval 0000000..99999999. I'll write this as U(0,10^8-1).
The problem is trickier than you might think. For example, the shell's random-number generator, $RANDOM, is U(0,2^15-1). Not big enough.

How about, say, summing calls to $RANDOM until the number is big enough?

Nope. Adding a bunch of uniformly-distributed numbers gives a random number, all right. But it isn't uniformly distributed anymore. It's a clumped-up, bell curve.

Hal explores several routes, culminating with a good one-liner:

$ head /dev/urandom | tr -dc 0-9 | cut -c1-8
But is there a way with the shell's own built-ins?

Yep. The answer's below. What's more, because this way doesn't call outside utilities or fork subshells, it should, in theory, be faster. And it is.

How much faster?

On my machine, the one-liner above generates 10,000 random numbers in just under 40 seconds. The code below does the same 10,000 in about a quarter of a second.

I've written it out as a script, with a dramatic reading in the comments, but if I were using it in something else, I'd ditch the big comments and put the code in-line.
#!/bin/bash
## Generate a uniformly distributed,
## 8-digit, random number: U(0,99999999)

## First, the logic

# (1) Get a uniformly-distributed random number,
# N = $RANDOM, up to R = 2^15-1 = 32767.
# That's U(0,R)

# (2) Stretch out the interval covered,
# up to nearly T = 10^8-1
# by multiplying by C = K/R.
# (We'll figure out K in a minute)
# This turns 0,1, 2, 3, ... into 0, C*1, C*2, ... K

# (3) Now add a random shift S = U(0,C-1)
# to turn these into 0, 0+1, 0+2, ... 0+C-1,
# C*1+0, C*1+1, ..., K+0, K+1, ... K+C-1.
# We can get S from another random number M = $RANDOM % C
# (% is the 'mod' operator)

# What's K? We want K+C-1 = T = 10^8-1, so we solve:
# K+C-1 = K+(K/R)-1 = 10^8-1; K[ (R+1)/R ] = 10^8;
# K = (10^8)*R/(R+1)

# For big R, that's so close to T
# that I'll just use K=T, C=T/R
# Our U(0,10^8-1) number is N*C + M

## Next, the code

# calculate the constants
(( R = 2**15-1, T = 10**8-1, C = T/R ))
readonly R T C

# do the calculation, and print the result out
# with 8, full digits.
printf "%8.8u\n" $(( RANDOM*C + ( RANDOM%C ) ))
Nothing up my sleeve ... Presto! Not bad, for a shell.


Update: Hal helpfully points out several things.
  • I misspelled his name.
Oof. Sorry. Fixed. Thanks.
  • There may be even faster solutions.
if the task were to generate 10K 8-digit
random numbers, I'd just suck 80K digits out of /dev/urandom and
chop them up into 8-digit chunks. That would be considerably
faster than running my command line 10K times in a row.
Maybe so .... Is there a way to do that in the shell? It's thought-provoking and an interesting challenge.

You couldn't store the 80K digits in the code, but you could keep from having to store them in a file, and the attendant I/O slowdown, by just providing them as output from a pipe.

(I'm told that in Windows, pipes are implemented with intermediate files. Not so in Unix and its offspring--the system really does hand data directly from one process to another.)

So, is there an easy way, in a shell script, to pull 8 digits at a time out of standard in?

  • Execution efficiency isn't why you use the shell.
If the task is to generate a single 8-digit number, I claim my solution
is better from a typing perspective. :-)
Just so. Hal's exactly right.