Friday, December 18, 2009

Tar Wars

Here's a little, galactic-history lesson. When you finish it, you'll know both a new command and a portable way to copy directory hierarchies.

A long time ago, in a galaxy far, far away, there was a great struggle over which archiving tool the POSIX standard would choose: BSD's tar or AT&T's cpio. Who would rule the Empire? TAR2-D2 or 3Cpio? This mouse-and-frog battle was dubbed Tar Wars. (Cue John Williams' score.)

The winner was us. AT&T's Glenn Fowler, with Joerg Schilling, designed and implemented pax (Posix Archive eXchange), to supplant both. Pax would read and write either format, and is now on every POSIX-conforming system. Better still, on mine it's 30% smaller than tar, and less than a third the size of cpio.

Peace through unity.

A little-known-by-me feature of pax is that it will even copy directory hierarchies. Me, I use cp -a to do this job, but that's a GNU-specific idiom. Other versions of cp may lack the -a flag.

This portable command will copy olddir into newdir (newdir must exist), preserving ownerships and permissions:
$ sudo pax -pewr olddir newdir
Mnemonic: the copy is pure ("pewr").

Wednesday, December 9, 2009

Octal NUL

"Well, here's another nice mess you've gotten me into." -- Oliver Hardy

It's disappointing, but not surprising, to see edge cases behave differently in different programs. It is surprising when they're inconsistent within one program: bash.

Oh, the behaviors are standards-conforming and well-documented. Still, watch this, keeping in mind that printf and echo are shell built-ins:
# within single quotes, the shell doesn't expand metacharacters
$ echo 'a\0000b' | od -c
0000000 a \ 0 0 0 0 b \n
0000010
# but echo -e interprets \n, \nn, \nnn, and \nnnn as octal characters
$ echo -e 'a\0000b' | od -c
0000000 a \0 b \n
0000004
# printf, however, only takes \n,\nn, and \nnn as octal
$ printf 'a\0000b' | od -c
0000000 a \0 0 b
0000004
# but $'...' is interpreted as a C string,
# and in C, \0 terminates a string
$ echo -e $'a\0000b' | od -c
0000000 a \n
0000002
In the last case, it's the shell interpreting the octal string before it even gets to echo.

Want proof?
$ cat <<< 'a\0000b' | od -c
0000000 a \ 0 0 0 0 b \n
0000010
$ cat <<< $'a\0000b' | od -c
0000000 a \n
0000002
Note also that much of this quirkiness only appears when you start using four-digit, octal representations and mess around with NUL (\0). Try keeping all those details in your head, bucko!

Me, I can't. Or won't. Good thing it's all documented in the man page.

"A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines." -- Emerson

Hat Tip: I started looking at this after puzzling over a line in Hal Pomeranz's latest Command-Line Kung Fu column.

Friday, December 4, 2009

Joinery

The join command has some useful options.

Hal Pomeranz has a nice example of using join to combine the output of two different commands in this week's Command-Line Kung Fu column.

After some discussion, he ends up with this:

$ join -1 1 -2 2 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *) \
| awk '{print $2 " " $1 " " $3}'

One reason he uses openssl is to help teach that in process substitution, the contents of <( ) can be pipeline. If you relax his didactic requirement, join's options let you do the job with a lot less typing.

$ join -j2 -o 1.1,0,2.1 <(sha1sum *) <(wc -c *)

Note that sha1sum has the same output format as wc -c, which makes the join easier.

Non-Linux boxes might not have sha1sum, but if I didn't have it, I'd see if I had md5sum, which has the same output format. Their Wikipedia entries say these commands are widely available on lots of non-Linux OSs.

Tuesday, November 24, 2009

My Phone Starts Giving Me Things I Never Had

This morning, my Android phone used Google's Turn-by-Turn navigation to coach me into work.

I'd wanted to play with voice navigation every since I got my Android phone, a few months ago, but all the available apps were pay services, so I waited. A few weeks ago, Google announced they'd provide it with Android 2.0, but my HTC Magic (T-Mobile myTouch) was still running 1.6.

Picture me, tapping my toe impatiently.

Yesterday, they back-ported the functionality to 1.6 and put it in the app store, and I grabbed it. This morning, I put in my destination, sat it in my lap, and set out.

Hearing it tell me directions as I drive is startling.

In one-quarter mile, turn right onto on-ramp, Highway 157, North.

Turn right onto on-ramp, Highway 157, North.

...

My father's Bell model 500 sits on my desk. When I got my first cell phone, I couldn't believe I was carrying a telephone in my pocket.

I shut off my land-line service, and jumped in with both feet. I quickly discovered I also had my watch, my alarm clock, and my pager in my pocket. My old pager was long gone, but I donated my alarm clock to the Salvation Army. My watch, a gift from an old girlfriend, went in a drawer.

The myTouch put my email reader in my pocket, too. To underscore this, Qwest, my DSL provider, had a service outage a week or so after I got it. No computer, but I could still read my email.

Having both persuaded me that I should take my Google Contacts seriously, so it now really does store all my email addresses and my phone numbers. The app that stores my library-card, my grocery-card, and other bar codes was also useful.

But all that was extensions to things I already had. The bubble level app was cute, but not important. Google Sky Maps was very, very cute, but not important.

Other stuff? Nice, if not quite ready for prime-time. Browser? Slooooow. Calendar? Too small to be useful. eBook reader? I'll let you know once there's a Kindle app.

Amazon Marketplace showed me the future of retail, but it's still mostly an "Oooh!"

The turn-by-turn directions though? Important.

I have a notebook full of maps I've printed to get places -- my own, personal atlas. No more. From now on, my phone can just tell me how to get where I want to go.

The phone in my pocket has replaced the GPS that I never owned.

Tuesday, November 17, 2009

Translate-As-You-Type

I'm interested in translation tools. Google Translate now does translate-as-you-type, instead of waiting for you to finish typing and submitting the result.

It's claimed that it will also do phonetic translation of non-Roman scripts, but I'm not seeing that with Yiddish.

Thursday, November 12, 2009

The "help" Command, And Other Aids

In his latest Command-Line Kung Fu post, Hal Pomeranz talks about places to get help on Unix systems and their kin. (Technically, that's "their latest post," but his co-conspirators write about Microsoft stuff, which I don't care about.)

Hal concentrates on man(1), info(1), and apropos(1), and the --help flag. Good places to start, but there are other things worth mentioning.

So I will.

locate(1) is quite handy. I often find myself hunting for commands that don't have man entries of any kind, and are installed someplace bizarre. Once I find them, /path/to/command --help often tells me enough. If not, but they're in some special directory, there's sometimes a README or examples that do the trick.

Another frequent stop is the web. If command --help doesn't tell me what I want to know, the web's more likely to have a man page than my machine is. I'll now usually go there first.

If it's not an executable? How about type?
$ type ls
ls is aliased to `ls --color=auto'
$ type cdjob
cdjob is a function
cdjob ()
{
local d;
: ${1:?"usage $FUNCNAME %N"};
d=$(jobs $1 | perl -lane 'print "cd $1" if m/.*\(wd: (.*)\).*/');
test "$d" && eval $d
}
And what's type?
$ type type
type is a shell builtin
For information about type, do we have to pour through the zillion-page, bash man page? On-line?

Nope. Watch:
$ man type
No manual entry for type
$ type type
type is a shell builtin
$ help type
type: type [-afptP] name [name ...]
Display information about command type.
For each NAME, indicate how it would be interpreted if used as a
command name.
Options:
-a display all locations containing an executable named NAME;
includes aliases, builtins, and functions, if and only if
the `-p' option is not also used
-f suppress shell function lookup
-P force a PATH search for each NAME, even if it is an alias,
builtin, or function, and returns the name of the disk file
that would be executed
-p returns either the name of the disk file that would be executed,
or nothing if `type -t NAME' would not return `file'.
-t output a single word which is one of `alias', `keyword',
`function', `builtin', `file' or `', if NAME is an alias, shell
reserved word, shell function, shell builtin, disk file, or not
found, respectively
Arguments:
NAME Command name to be interpreted.
Exit Status:
Returns success if all of the NAMEs are found; fails if any are not found.
typeset: typeset [-aAfFilrtux] [-p] name[=value] ...
Set variable values and attributes.
Obsolete. See `help declare'.
Yes, there's really a help command. It gives help about shell builtins.

If I don't even have the command? (If, for example, it's part of someone else's script I've pulled down from somewhere?)

Here's what I get out of bash when I invoke als(1) on my Ubuntu box.
$ als
The program 'als' is currently not installed. You can install it by typing:
sudo apt-get install atool
als: command not found
Not help with the command, okay, but help getting it so I can then get help with it.

What would you pay? But wait. There's more.

If I guess at a command but mis-type it?
$ las
No command 'las' found, did you mean:
Command 'als' from package 'atool' (universe)
Command 'ls' from package 'coreutils' (main)
Command 'lfs' from package 'lustre-utils' (universe)
Command 'lms' from package 'lms' (universe)
Command 'les' from package 'atm-tools' (universe)
Command 'last' from package 'sysvinit-utils' (main)
Command 'laps' from package 'epix1' (universe)
Command 'lvs' from package 'lvm2' (main)
Command 'cas' from package 'amule-adunanza-utils' (universe)
Command 'cas' from package 'amule-utils' (universe)
Command 'as' from package 'binutils' (main)
Command 'ras' from package 'ras' (universe)
Command 'kas' from package 'openafs-kpasswd' (universe)
Command 'lat' from package 'lat' (universe)
las: command not found
I think that's nice. And I can use apt-cache show to tell me what each package is.




Wednesday, November 11, 2009

Valour-IT Fundraiser




If you're geeky enough to be reading this blog, you're geeky enough to appreciate the idea of a project that provides computers to wounded vets.


Monday, November 9, 2009

Checking Scripts for Syntax Errors

If I'm maintaining a lot of shell scripts in a directory, I usually want to syntax-check them before I check them in.

The file(1) command will mark most of them as shell scripts. The rest are, typically, libraries of shell functions that I name with .sh suffixes.

All shell scripts that file can find:

$ file * | awk -F: '/shell/{print $1}'

plus the libraries

$ file * | awk -F: '/shell/{print $1}' ; echo *.sh

now syntax check them

$ for i in $(file * | awk -F: '/shell/{print $1}' ; echo *.sh); do bash -n $i || echo $i fails; done

Just with command-line recall and editing.


Friday, October 23, 2009

Google's Bulgarian Picture Dictionary

Language translation is serious business, but playing with automatic translation tools is just fun.

I use Google Translate a lot, along with their translation bots. I've played with the Word Monkey gadget for iGoogle.

I've even built spreadsheets to do translation with Google's spreadsheet functions; I'd bet even beginning corpus-linguistics students could do some cool experiments with these.

Today, my pal, Kevin Cohen, put up this Google Chat status message:
My favorite Bulgarian word is хляб. What's yours?
I put it into Google Image Search, to see what would happen, and got this.

I think I can guess what хляб means.

Thursday, October 22, 2009

A GUI Replacement for sshfs

I've been using sshfs to remote-mount my home machine onto my work machine. Now, I've stumbled on an alternative that feels better integrated with my Gnome desktop: gvfs (gnome virtual file system) in Nautilus.

The short version:
$ nautilus sftp://test.com
brings up a file-browser window with test.com in it, which I can then browse and click around in. Also, test.com is mounted as ~/.gvfs/sftp on test.com/

Yes, it has all those blanks. Oh well. You can still get to it on the command line and in scripts.

If you already have a file-browser window up, just use the URI sftp://test.com

If you need to go in through a different port from 22, say 666, you'll first need to add a pair of lines to your .ssh/config file that look like like this:
Host test.com
Port 666
You can test whether they work or not by using ssh by hand.
$ ssh test.com

Monday, October 19, 2009

Bugs

All programs have bugs. A "mature program" is a program with more obscure bugs.

When I first started programming, I was convinced that half my programs had uncovered bugs in the compiler.

They never were. "Oh. 'Missing semicolon on line 71' goes away if I put a semicolon in."

Because compilers are more mature than my code, the probability is higher that bugs are mine.

When a shell script I wrote last week failed, I began carving it down to see what I'd done wrong. This is usually fast and easy--chop pieces out of the script until you can get a very simplified statement that doesn't do what I thought it should do, then go back to read the man page to see why.

Here's the simplified test case I ended up with:
set -x # this is required
PS4='$(true)$ ' # must be a command in the prompt, but any command will do

false || A=3 # first expression must fail, second must be a variable assignment
echo $? # Should be 0, but isn't. Odd.
Much to my surprise, it's bug in bash, both versions 3 and 4.

Obscure? Sure. You have to have a lot of special things going on, and the symptom is a bad exit code. I reported it and used a workaround (if .. fi instead of the shortcut logical or).

Always remember: every program has bugs. They're only almost always yours.

Saturday, October 3, 2009

Hello, World Again

I'd like to try saying "We sometimes take the shell too much for granted," another way.

Advice from masters is often good advice.

Here's my favorite chunk of the greatest of all programming texts, Kernighan and Ritchie's The C Programming Language (Prentice-Hall, 1978).

1.1 Getting Started

The only way to learn a new programming language is by writing programs in it. The first program to write is the same for all languages:

Print the words
hello, world
This is the basic hurdle; to leap over it you have to be able to create the program text somewhere, compile it successfully, load it, run it, and find out where your output went. With these mechanical details mastered, everything else is comparatively easy.

In C, the program to print "hello, world" is
#include
main()
{
printf("hello, world\n");
}

Just how to run this program depends on the system you are using. As a specific example, on the UNIX operating system you must create the program in a file whose name ends in ".c", such as hello.c, then compile it with the command
cc hello.c
If you haven't botched anything, such as omitting a character or misspelling something, the compilation will proceed silently, and make an executable file called a.out. Running that by the command
a.out
will produce
hello, world
as its output. On other systems, the rules will be different; check with a local expert.

Exercise 1-1. Run this program on your system. Experiment with leaving out parts of the program, to see what error messages you get.

Fine advice. Let's do exercise 1-1, but in the shell.

Run the analogous program? Okay.
$ echo hello, world
hello, world
Leave out parts? Let's leave out a part of the string.
$ echo hell, world
hell, world
Or part of the command.
$ eco hello, world
bash: eco: command not found
How about whitespace? It's okay to leave it out of the string,
$ echo hello,world
hello,world
but you need some between a command and its arguments.
$ echohello, world
bash: echohello,: command not found
My point: Programing in the shell is quick and easy. You just type. There's no editing, no special naming, no compiling, no a.out file, no loading and running, no need to consult a local expert.

If you type something incomprehensible, the shell gives you an error message and lets you try again, right away.

Thursday, October 1, 2009

The Shell Enters a Beauty Contest

I'd never tout the shell as the be-all and end-all of programming languages, but it gets less attention and respect than it deserves.

For example, folks will remark, casually, that shell syntax is ugly. Who would design a language that doesn't even let you put spaces around the '=' in an assignment?
$ x=3
$ y = 3
-bash: y: command not found
$ z= 5
-bash: 5: command not found
Eeew. Real programs are in C. Or Perl. Or Python. Or Haskell. Or ...

Yep, the shell syntax has some design flaws all right. But let's run another beauty contest.

First, contestant #1:
#include <unistd.h>
#include <stdlib.h>

int main(void)
{
int     fd[2], nbytes;
pid_t   pid;

pipe(fd);

if ((pid = fork()) == 0) {
  dup2(fd[0], 0);
  close(fd[1]);
  execlp("/bin/grep", "/bin/grep", "^z", NULL);
} else {
  dup2(fd[1], 1);
  close(fd[0]);
  execlp("/bin/ls", "/bin/ls", "-1", "/bin", NULL);
  wait(NULL);
  exit(0);
}

return(0);
}

Next, contestant #2:

ls /bin | grep ^z
And contestant #1 doesn't even have normal error checking, which would make it much longer, uglier, and hard-to-follow.

Programmers get so used to the shell that they focus on its flaws, but take its virtues for granted.

Don't it always seem to go that you don't know what you've got till it's gone? -- Joni Mitchell
What tastes of paradise does the shell offer besides pipes? I/O redirection. Ease of process creation. Multi-process programming. Parallelism. Command-line editing. For that matter, the entire idea of a CLI, a "command-line interface."

Once I start listing things, it's hard to stop.

All this in a language you can use in scripts, or just by doing nothing harder than typing at a prompt.

Wednesday, September 30, 2009

Grobycs: Cyborgs in Reverse

We've read about cyborgs, computer-enhanced humans, for decades. Misdirection? We should have been thinking about grobycs: human-enhanced computers. Cyborgs in reverse.

Consider these:
All use people as a piece of a collection of software. They're cyborgs in reverse: Grobycs.

The Six-Million-Dollar Man wanted her:

What do you think grobycs will lust after?

Wednesday, September 23, 2009

Variables Without Values

Often, you only want to do something if a variable is unset.

I see code to handle such conditions that looks like this.
if [ "A$foo" = "A" ]
then
take-some-action
fi
It's either old code or code by old programmers.

The test is how you used to have to ask whether a variable was empty. Here's a newer idiom.

[ "$foo" ] || take-some-action
The test operator, [ ], comes in many flavors. For example
[ -d $X ] asks "Does $X name a directory?"

In its simplest form, though,

[ $X ]
asks "Is $X empty?" The test returns true if $X has something in it, false if it doesn't. The new code does the same thing as the old, but it's shorter and, arguably, easier to read.

But why not this?

[ $foo ] || take-some-action
Ah. Because if foo='1 2 3', then test complains that you've given it too many arguments.

[: 2: unary operator expected
One more trick: what if you're running with "set -u", which complains whenever it stumbles on an unset variable, with complaints like this
line 3: foo: unbound variable

Write the test like this:
["${foo-}" ] || take some action
If $foo is unset, ${foo-} has a null value. "Null" is a value; "unset" really means the variable was never set. So, with "${foo-}" the shell won't complain about an unset variable, but the test will still fail.

This is still shorter than the example we started with, but it's only more readable once you can read the idiom ${foo-} without stumbling.

On the other hand, the first example will also choke if $foo is unset, so it would also have to be modified like this:
if [ "A${foo-}" = "A" ]
then
take-some-action
fi

More Include Guards

I've already written about how to make include guards for "shell libraries" (files full of shell functions or variables to be sourced).

This is the basic form of the statement, which requires tailoring for each library:
if [ ${_gripe_version:-0} -gt 0 ]
then
return 0
else
_gripe_version=1
fi
Here's a more generic statement that you can put at the top of a file to achieve the same goal:
[ "${_libs/${BASH_SOURCE[0]}}" = "${_libs=}" ] ||
return 0 && _libs+=" ${BASH_SOURCE[0]}"
A dramatic reading of this off-putting code is left as an exercise to the reader.

It has the added attraction that you can, at any point, find out which libraries you've already sourced with echo $_libs .

It has the added detraction that you'll have to cut-and-paste it. There's no way you'll ever keep it in your head and type it correctly.

As a good test, first try this infinite recursion without include guards. (You'll have to ^C out quickly, or you'll get stuck in a source-a-thon.)
$ echo 'source foo.sh' > foo.sh
$ chmod +x foo.sh
$ ./foo.sh # don't wait long to ^C
Next, edit foo.sh to add an include guard at the top and re-run it. This time, it will return immediately, without help.

Friday, September 18, 2009

Simplifying Loopy Code


The determined Real Programmer can write FORTRAN programs in any language.
I just read through a friend's shell script. He doesn't program in the shell much, but he's a superb C programmer, so his script has loops that look like this:
file[0]=foo
file[1]=bar
file[2]=mumble
suffix[0]=.c
suffix[1]=.h
suffix[2]=.txt

nfiles=${#file[@]}
nsuffixes=${#suffix[@]}


i=0while [ $i -lt $nfiles ]
do
j=0
while [ $j -lt $nsuffixes ]
do
process ${file[$i]}${suffix[$j]}
(( j = j + 1 ))
done
(( i = i + 1 ))
done
Yep. That'll work. But this will, too.
for filename in {foo,bar,mumble}.{c,h,txt}
do

process $filename
done
Bash sees filenames as strings, and most shell commands accept space-separated lists of filenames. The shell handles anything that expands to a list of filenames, like globs or brace expansions, with real ease.

Use the Shell, Luke.

Wednesday, September 16, 2009

Call Me Now

Click on the Google Call Me button, below, fill in your own name and phone number, and then click Connect.

You'll be connected to me immediately.



Saturday, September 5, 2009

But *How* Random Is It?

Folks use /dev/urandom and $RANDOM to generate uniformly-distributed random numbers. Are they random? Yes.

Let's test. Here's how.

Surprise! The easiest way to start is to ask a more general question: Suppose you have a couple of sets of numbers. Are they from the same distribution?

For example, if you have a couple of batches of 10,000, 8-digit numbers, from the two different distributions, are they really spread out from 0 to 9999 the same way, or do they have clumps in different places?

There are ways to attack this, even if you don't know (or care) what the original distribution was. Here's a step-by-step walkthrough:
  • Sort them together, smallest to biggest, coloring the first set red and the second, green.
  • Put your finger on the middle of a line--zero on the number line--and begin reading off colors. If the number's red, move your finger left. If it's green, move it right.
  • See how far away from the origin you wander.
Half the numbers are red, half green, so you'll end up back where you started.

How far away might you get along the way? Depends on the distribution of the reds and greens. If all the reds are smaller than all the greens, you'll go left to ten thousand, then turn around and come right back. If they're all bigger, then you'll get ten thousand away to the right before you snap back like a yo-yo. And except for these cases, you won't go as far before you return.

If the numbers are from the same distribution, then whether they're red or green is a coin-toss. The distance I expect to get from the the origin goes up as the square-root of the size of my collections.

(That's on average -- any single pair of batches could end up almost anywhere. "... the 'one chance in a million' will undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be ...." -- R. A. Fisher)

There are lots of ways to get this result:
  1. from the standard deviation of a binomial with p=0.5
  2. from a two-sample Kolmogorov-Smirnov statistic with equal sample sizes
  3. from looking at it as Brownian motion, using the root-mean-square (RMS) argument attributed to Einstein, in Feynman, volume 1, chapter 6
All roads lead to Rome.
Note: nothing I've said, so far, depends on the numbers being random--this just tests whether the red and green numbers come from the same distribution. In what follows, though, I'm going to ask whether batches spit out by random-number generators are uniformly-distributed.
Let's write code. And, since all code roads lead to Rome, too, I'll do it in the shell.

Start with two, random-number-generators, R1 and R2 that generate some fixed-size batch of random numbers.
  • Check that each one spits out the same sized batch.
$ R1 | wc -l ; R2 | wc -l
  • Mark each output with a second field, saying which way each number will move my finger
$ R1 | awk '$2=1'; R2 | awk '$2=-1'
  • Sort the two together, on the first column
$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1')
  • Run through the output, keeping track, at each step, of how far I've gotten away from the origin
$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1') | awk 'pos+=$2; d = ( pos <>
Notice how I'm just recalling the last command and editing it. All this is on the command line, where I can watch the effect of each step.
  • Print just the the biggest distance I get from the origin, for the whole trip

$ sort -n <(R1 | awk '$2=1') <(R2 | awk '$2=-1') | awk 'pos+=$2; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1
Easy enough. I'll try it first with a script that uses the Halgorithm:
#!/bin/bash

head -10000 /dev/urandom | tr -dc 0-9 | perl -pe 's/(.{8})/\1\n/g' | head -${1:-10000}
Whatever argument I give it on the command line tells it how many eight-digit integers to generate. [Default: 10,000]

Two runs should produce different batches, but the same distribution. I bring back the earlier command and do half a dozen runs with 100 numbers from each batch.
$ for i in {1..6}; do sort -n <(R1 100 | awk '$2=1') <(R1 100 | awk '$2=-1') | awk 'pos+=$2; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1 ; done
13
20
6
7
10
10
The average? 10.6 .

(Careful! I'm comparing the batch produced by one run of the generator with the run produced by a different batch. It tells me what to expect for different batches from the same distribution.)

More runs, a hundred instead of a dozen, just gives a more accurate average: 12.2

If it varies with the square of the size of the batches, batches of 10,000 should give numbers nearer to 120.
$ for i in {1..6}; do sort -n <(R1 | awk '">=1') <(R1 | awk '">=-1') | awk 'pos+=">; d = ( pos > 0 ) ? -pos : pos; print d' | sort -n | tail -1 ; done

90
97
71
91
105
127

Good. The average value, over 100 paired runs, is about 125.

This, finally, lets me compare Hal's method with the shell-only method from an earlier post, which uses $RANDOM? Does $RANDOM have a different distribution from /dev/urandom?

Nope. The average distance is, again, about 125. If they're non-uniform, they're non-uniform in the same way.

But maybe neither is random. What if I compare them with a batch of uniformly-distributed, true random integers, downloaded from the web?

About 135. A little higher, but not high enough to set off a panic alarm. $RANDOM and /dev/urandom both give pseudo-random-numbers that are, roughly, uniformly distributed.

Thursday, September 3, 2009

Another Random Post

Hal Pomeranz suggests this one-liner
head /dev/urandom | tr -dc 0-9 | sed -r 's/(.{8})/\1\n/g'
for generating big batches of random numbers. It's a winner: reasonable performance and easy-to-type. Timing information follows:
#!/bin/bash

echo == Shell arithmetic
(( R = 2**15-1, T = 10**8-1, C = T/R ))
readonly R T C

time for i in {0..10000}
do
printf "%8.8u\n" $(( RANDOM*C + ( RANDOM%C ) ))
done > /dev/null

echo; echo == Pipeline
time head -10000 /dev/urandom | tr -dc 0-9 | sed -r 's/(.{8})/\1\n/g' | head -10000 > /dev/null
Here's the numbers:

== Shell arithmetic

real 0m0.262s
user 0m0.256s
sys 0m0.004s

== Pipeline

real 0m0.421s
user 0m0.028s
sys 0m0.392s
Okay, the pipe's a little slower, but it's the same order of magnitude, and way easier to type.

Wednesday, September 2, 2009

The International Slide Rule Museum

From Mike Rosenlof:
They have (who am I kidding! HE has..) an exhibit at the Louisville Public Library.

http://sliderulemuseum.com/

Tuesday, September 1, 2009

Another Challenge From Hal & Ed: Random Numbers In the Shell

Hal Pomeranz and Ed Skoudis have a weekly blog, Command Line Kung Fu, where they set and solve interesting problems in both the Unix/Linux shell and Windows's CLI, COMMAND.COM.

This week's was how to generate random 8-digit integers.

Obligatory pedantic sidebar, which you can skip:

"Random" can mean a lot of things. If I flip a coin, and announce either "00000000" or "00000001", that's an 8-digit, random number. Ho hum. What I'm looking for here is integers uniformly distributed in the interval 0000000..99999999. I'll write this as U(0,10^8-1).
The problem is trickier than you might think. For example, the shell's random-number generator, $RANDOM, is U(0,2^15-1). Not big enough.

How about, say, summing calls to $RANDOM until the number is big enough?

Nope. Adding a bunch of uniformly-distributed numbers gives a random number, all right. But it isn't uniformly distributed anymore. It's a clumped-up, bell curve.

Hal explores several routes, culminating with a good one-liner:

$ head /dev/urandom | tr -dc 0-9 | cut -c1-8
But is there a way with the shell's own built-ins?

Yep. The answer's below. What's more, because this way doesn't call outside utilities or fork subshells, it should, in theory, be faster. And it is.

How much faster?

On my machine, the one-liner above generates 10,000 random numbers in just under 40 seconds. The code below does the same 10,000 in about a quarter of a second.

I've written it out as a script, with a dramatic reading in the comments, but if I were using it in something else, I'd ditch the big comments and put the code in-line.
#!/bin/bash
## Generate a uniformly distributed,
## 8-digit, random number: U(0,99999999)

## First, the logic

# (1) Get a uniformly-distributed random number,
# N = $RANDOM, up to R = 2^15-1 = 32767.
# That's U(0,R)

# (2) Stretch out the interval covered,
# up to nearly T = 10^8-1
# by multiplying by C = K/R.
# (We'll figure out K in a minute)
# This turns 0,1, 2, 3, ... into 0, C*1, C*2, ... K

# (3) Now add a random shift S = U(0,C-1)
# to turn these into 0, 0+1, 0+2, ... 0+C-1,
# C*1+0, C*1+1, ..., K+0, K+1, ... K+C-1.
# We can get S from another random number M = $RANDOM % C
# (% is the 'mod' operator)

# What's K? We want K+C-1 = T = 10^8-1, so we solve:
# K+C-1 = K+(K/R)-1 = 10^8-1; K[ (R+1)/R ] = 10^8;
# K = (10^8)*R/(R+1)

# For big R, that's so close to T
# that I'll just use K=T, C=T/R
# Our U(0,10^8-1) number is N*C + M

## Next, the code

# calculate the constants
(( R = 2**15-1, T = 10**8-1, C = T/R ))
readonly R T C

# do the calculation, and print the result out
# with 8, full digits.
printf "%8.8u\n" $(( RANDOM*C + ( RANDOM%C ) ))
Nothing up my sleeve ... Presto! Not bad, for a shell.


Update: Hal helpfully points out several things.
  • I misspelled his name.
Oof. Sorry. Fixed. Thanks.
  • There may be even faster solutions.
if the task were to generate 10K 8-digit
random numbers, I'd just suck 80K digits out of /dev/urandom and
chop them up into 8-digit chunks. That would be considerably
faster than running my command line 10K times in a row.
Maybe so .... Is there a way to do that in the shell? It's thought-provoking and an interesting challenge.

You couldn't store the 80K digits in the code, but you could keep from having to store them in a file, and the attendant I/O slowdown, by just providing them as output from a pipe.

(I'm told that in Windows, pipes are implemented with intermediate files. Not so in Unix and its offspring--the system really does hand data directly from one process to another.)

So, is there an easy way, in a shell script, to pull 8 digits at a time out of standard in?

  • Execution efficiency isn't why you use the shell.
If the task is to generate a single 8-digit number, I claim my solution
is better from a typing perspective. :-)
Just so. Hal's exactly right.