See Jeff Run: September 2009

Wednesday, September 30, 2009

Grobycs: Cyborgs in Reverse

We've read about cyborgs, computer-enhanced humans, for decades. Misdirection? We should have been thinking about grobycs: human-enhanced computers. Cyborgs in reverse.

Consider these:

Facebook offers crowdsourced translations as an alternative to Google Translate.
Google gives me real-time traffic data collected from folks using their GPS-enabled smartphones to look at maps: in fact, at traffic data.
I can send Amazon a barcode scan , using my phone's camera. If they don't stock the item, they let me send a picture, from the same camera, and nearly instantly, offer me similar products. How? Amazon Mechanical Turk.
Spammers break into captcha-guarded sites by running a porn site. Every time they hit a captcha, they pass it through to the porn-site registration side. The next human who visits the site decodes it.

All use people as a piece of a collection of software. They're cyborgs in reverse: Grobycs.

The Six-Million-Dollar Man wanted her:

What do you think grobycs will lust after?

Wednesday, September 23, 2009

Variables Without Values

Often, you only want to do something if a variable is unset.

I see code to handle such conditions that looks like this.

if [ "A$foo" = "A" ]

then

take-some-action

fi

It's either old code or code by old programmers.

The test is how you used to have to ask whether a variable was empty. Here's a newer idiom.

[ "$foo" ] || take-some-action

The test operator, [ ], comes in many flavors. For example

[ -d $X ] asks "Does $X name a directory?"

In its simplest form, though,

[ $X ]

asks "Is $X empty?" The test returns true if $X has something in it, false if it doesn't. The new code does the same thing as the old, but it's shorter and, arguably, easier to read.

But why not this?

[ $foo ] || take-some-action

Ah. Because if foo='1 2 3', then test complains that you've given it too many arguments.

[: 2: unary operator expected

One more trick: what if you're running with "set -u", which complains whenever it stumbles on an unset variable, with complaints like this

line 3: foo: unbound variable

Write the test like this:

["${foo-}" ] || take some action

If $foo is unset, ${foo-} has a null value. "Null" is a value; "unset" really means the variable was never set. So, with "${foo-}" the shell won't complain about an unset variable, but the test will still fail.

This is still shorter than the example we started with, but it's only more readable once you can read the idiom ${foo-} without stumbling.

On the other hand, the first example will also choke if $foo is unset, so it would also have to be modified like this:

if [ "A${foo-}" = "A" ]

then

take-some-action

fi

More Include Guards

I've already written about how to make include guards for "shell libraries" (files full of shell functions or variables to be sourced).

This is the basic form of the statement, which requires tailoring for each library:

if [ ${_gripe_version:-0} -gt 0 ]

then

return 0

else

_gripe_version=1

fi

Here's a more generic statement that you can put at the top of a file to achieve the same goal:

[ "${_libs/${BASH_SOURCE[0]}}" = "${_libs=}" ] ||

return 0 && _libs+=" ${BASH_SOURCE[0]}"

A dramatic reading of this off-putting code is left as an exercise to the reader.

It has the added attraction that you can, at any point, find out which libraries you've already sourced with echo $_libs .

It has the added detraction that you'll have to cut-and-paste it. There's no way you'll ever keep it in your head and type it correctly.

As a good test, first try this infinite recursion without include guards. (You'll have to ^C out quickly, or you'll get stuck in a source-a-thon.)

$ echo 'source foo.sh' > foo.sh

$ chmod +x foo.sh

$ ./foo.sh # don't wait long to ^C

Next, edit foo.sh to add an include guard at the top and re-run it. This time, it will return immediately, without help.

Friday, September 18, 2009

Simplifying Loopy Code

The determined Real Programmer can write FORTRAN programs in any language.

-- Ed Post, Real Programmers Don't Use PASCAL.

I just read through a friend's shell script. He doesn't program in the shell much, but he's a superb C programmer, so his script has loops that look like this:

file[0]=foo

file[1]=bar

file[2]=mumble

suffix[0]=.c

suffix[1]=.h

suffix[2]=.txt

nfiles=${#file[@]}

nsuffixes=${#suffix[@]}

i=0while [ $i -lt $nfiles ]

do

j=0

while [ $j -lt $nsuffixes ]

do

process ${file[$i]}${suffix[$j]}

(( j = j + 1 ))

done

(( i = i + 1 ))

done

Yep. That'll work. But this will, too.

for filename in {foo,bar,mumble}.{c,h,txt}

do

process $filename

done

Bash sees filenames as strings, and most shell commands accept space-separated lists of filenames. The shell handles anything that expands to a list of filenames, like globs or brace expansions, with real ease.

Use the Shell, Luke.

Wednesday, September 16, 2009

Call Me Now

Click on the Google Call Me button, below, fill in your own name and phone number, and then click Connect.

You'll be connected to me immediately.

Saturday, September 5, 2009

But How Random Is It?

Folks use /dev/urandom and $RANDOM to generate uniformly-distributed random numbers. Are they random? Yes.

Let's test. Here's how.

Surprise! The easiest way to start is to ask a more general question: Suppose you have a couple of sets of numbers. Are they from the same distribution?

For example, if you have a couple of batches of 10,000, 8-digit numbers, from the two different distributions, are they really spread out from 0 to 9999 the same way, or do they have clumps in different places?

There are ways to attack this, even if you don't know (or care) what the original distribution was. Here's a step-by-step walkthrough:

Sort them together, smallest to biggest, coloring the first set red and the second, green.

Put your finger on the middle of a line--zero on the number line--and begin reading off colors. If the number's red, move your finger left. If it's green, move it right.

See how far away from the origin you wander.

Half the numbers are red, half green, so you'll end up back where you started.

How far away might you get along the way? Depends on the distribution of the reds and greens. If all the reds are smaller than all the greens, you'll go left to ten thousand, then turn around and come right back. If they're all bigger, then you'll get ten thousand away to the right before you snap back like a yo-yo. And except for these cases, you won't go as far before you return.

If the numbers are from the same distribution, then whether they're red or green is a coin-toss. The distance I expect to get from the the origin goes up as the square-root of the size of my collections.

(That's on average -- any single pair of batches could end up almost anywhere. "... the 'one chance in a million' will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be ...." -- R. A. Fisher)

There are lots of ways to get this result:

from the standard deviation of a binomial with p=0.5
from a two-sample Kolmogorov-Smirnov statistic with equal sample sizes
from looking at it as Brownian motion, using the root-mean-square (RMS) argument attributed to Einstein, in Feynman, volume 1, chapter 6

All roads lead to Rome.

Note: nothing I've said, so far, depends on the numbers being random--this just tests whether the red and green numbers come from the same distribution. In what follows, though, I'm going to ask whether batches spit out by random-number generators are uniformly-distributed.

Let's write code. And, since all code roads lead to Rome, too, I'll do it in the shell.

Start with two, random-number-generators, R1 and R2 that generate some fixed-size batch of random numbers.

Check that each one spits out the same sized batch.

$ R1 | wc -l ; R2 | wc -l

Mark each output with a second field, saying which way each number will move my finger

$ R1 | awk '$2=1'; R2 | awk '$2=-1'

Sort the two together, on the first column

$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1')

Run through the output, keeping track, at each step, of how far I've gotten away from the origin

$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1') | awk 'pos+=$2; d = ( pos <>

Notice how I'm just recalling the last command and editing it. All this is on the command line, where I can watch the effect of each step.

Print just the the biggest distance I get from the origin, for the whole trip

$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1') | awk 'pos+=$2; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1

Easy enough. I'll try it first with a script that uses the Halgorithm:

#!/bin/bash

head -10000 /dev/urandom | tr -dc 0-9 | perl -pe 's/(.{8})/\1\n/g' | head -${1:-10000}

Whatever argument I give it on the command line tells it how many eight-digit integers to generate. [Default: 10,000]

Two runs should produce different batches, but the same distribution. I bring back the earlier command and do half a dozen runs with 100 numbers from each batch.

$ for i in {1..6}; do sort -n <(R1 100 | awk '$2=1') <(R1 100 | awk '$2=-1') | awk 'pos+=$2; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1 ; done

13

20

6

7

10

10

The average? 10.6 .

(Careful! I'm comparing the batch produced by one run of the generator with the run produced by a different batch. It tells me what to expect for different batches from the same distribution.)

More runs, a hundred instead of a dozen, just gives a more accurate average: 12.2

If it varies with the square of the size of the batches, batches of 10,000 should give numbers nearer to 120.

$ for i in {1..6}; do sort -n <(R1 | awk '">=1') <(R1 | awk '">=-1') | awk 'pos+=">; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1 ; done

90

97

71

91

105

127

Good. The average value, over 100 paired runs, is about 125.

This, finally, lets me compare Hal's method with the shell-only method from an earlier post, which uses $RANDOM? Does $RANDOM have a different distribution from /dev/urandom?

Nope. The average distance is, again, about 125. If they're non-uniform, they're non-uniform in the same way.

But maybe neither is random. What if I compare them with a batch of uniformly-distributed, true random integers, downloaded from the web?

About 135. A little higher, but not high enough to set off a panic alarm. $RANDOM and /dev/urandom both give pseudo-random-numbers that are, roughly, uniformly distributed.

Thursday, September 3, 2009

Another Random Post

Hal Pomeranz suggests this one-liner

head /dev/urandom | tr -dc 0-9 | sed -r 's/(.{8})/\1\n/g'

for generating big batches of random numbers. It's a winner: reasonable performance and easy-to-type. Timing information follows:

#!/bin/bash

echo == Shell arithmetic

(( R = 2**15-1, T = 10**8-1, C = T/R ))

readonly R T C

time for i in {0..10000}

do

printf "%8.8u\n" $(( RANDOM*C + ( RANDOM%C ) ))

done > /dev/null

echo; echo == Pipeline

time head -10000 /dev/urandom | tr -dc 0-9 | sed -r 's/(.{8})/\1\n/g' | head -10000 > /dev/null

Here's the numbers:

== Shell arithmetic

real 0m0.262s

user 0m0.256s

sys 0m0.004s

== Pipeline

real 0m0.421s

user 0m0.028s

sys 0m0.392s

Okay, the pipe's a little slower, but it's the same order of magnitude, and way easier to type.

Wednesday, September 2, 2009

The International Slide Rule Museum

From Mike Rosenlof:

They have (who am I kidding! HE has..) an exhibit at the Louisville Public Library.

http://sliderulemuseum.com/

Tuesday, September 1, 2009

Another Challenge From Hal & Ed: Random Numbers In the Shell

Hal Pomeranz and Ed Skoudis have a weekly blog, Command Line Kung Fu, where they set and solve interesting problems in both the Unix/Linux shell and Windows's CLI, COMMAND.COM.

This week's was how to generate random 8-digit integers.

Obligatory pedantic sidebar, which you can skip:

"Random" can mean a lot of things. If I flip a coin, and announce either "00000000" or "00000001", that's an 8-digit, random number. Ho hum. What I'm looking for here is integers uniformly distributed in the interval 0000000..99999999. I'll write this as U(0,10^8-1).

The problem is trickier than you might think. For example, the shell's random-number generator, $RANDOM, is U(0,2^15-1). Not big enough.

How about, say, summing calls to $RANDOM until the number is big enough?

Nope. Adding a bunch of uniformly-distributed numbers gives a random number, all right. But it isn't uniformly distributed anymore. It's a clumped-up, bell curve.

Hal explores several routes, culminating with a good one-liner:

$ head /dev/urandom | tr -dc 0-9 | cut -c1-8

But is there a way with the shell's own built-ins?

Yep. The answer's below. What's more, because this way doesn't call outside utilities or fork subshells, it should, in theory, be faster. And it is.

How much faster?

On my machine, the one-liner above generates 10,000 random numbers in just under 40 seconds. The code below does the same 10,000 in about a quarter of a second.

I've written it out as a script, with a dramatic reading in the comments, but if I were using it in something else, I'd ditch the big comments and put the code in-line.

#!/bin/bash
## Generate a uniformly distributed,

## 8-digit, random number: U(0,99999999)

## First, the logic

# (1) Get a uniformly-distributed random number,
# N = $RANDOM, up to R = 2^15-1 = 32767.
# That's U(0,R)

# (2) Stretch out the interval covered,
# up to nearly T = 10^8-1
# by multiplying by C = K/R.
# (We'll figure out K in a minute)
# This turns 0,1, 2, 3, ... into 0, C*1, C*2, ... K

# (3) Now add a random shift S = U(0,C-1)
# to turn these into 0, 0+1, 0+2, ... 0+C-1,
# C*1+0, C*1+1, ..., K+0, K+1, ... K+C-1.
# We can get S from another random number M = $RANDOM % C
# (% is the 'mod' operator)

# What's K? We want K+C-1 = T = 10^8-1, so we solve:
# K+C-1 = K+(K/R)-1 = 10^8-1; K[ (R+1)/R ] = 10^8;
# K = (10^8)*R/(R+1)

# For big R, that's so close to T
# that I'll just use K=T, C=T/R
# Our U(0,10^8-1) number is N*C + M

## Next, the code

# calculate the constants
(( R = 2**15-1, T = 10**8-1, C = T/R ))
readonly R T C

# do the calculation, and print the result out
# with 8, full digits.
printf "%8.8u\n" $(( RANDOM*C + ( RANDOM%C ) ))

Nothing up my sleeve ... Presto! Not bad, for a shell.

Update: Hal helpfully points out several things.

I misspelled his name.

Oof. Sorry. Fixed. Thanks.

There may be even faster solutions.

if the task were to generate 10K 8-digit
random numbers, I'd just suck 80K digits out of /dev/urandom and
chop them up into 8-digit chunks. That would be considerably
faster than running my command line 10K times in a row.

Maybe so .... Is there a way to do that in the shell? It's thought-provoking and an interesting challenge.

You couldn't store the 80K digits in the code, but you could keep from having to store them in a file, and the attendant I/O slowdown, by just providing them as output from a pipe.

(I'm told that in Windows, pipes are implemented with intermediate files. Not so in Unix and its offspring--the system really does hand data directly from one process to another.)

So, is there an easy way, in a shell script, to pull 8 digits at a time out of standard in?

Execution efficiency isn't why you use the shell.

If the task is to generate a single 8-digit number, I claim my solution
is better from a typing perspective. :-)

Just so. Hal's exactly right.

See Jeff Run