Friday, December 18, 2009

Tar Wars

Here's a little, galactic-history lesson. When you finish it, you'll know both a new command and a portable way to copy directory hierarchies.

A long time ago, in a galaxy far, far away, there was a great struggle over which archiving tool the POSIX standard would choose: BSD's tar or AT&T's cpio. Who would rule the Empire? TAR2-D2 or 3Cpio? This mouse-and-frog battle was dubbed Tar Wars. (Cue John Williams' score.)

The winner was us. AT&T's Glenn Fowler, with Joerg Schilling, designed and implemented pax (Posix Archive eXchange), to supplant both. Pax would read and write either format, and is now on every POSIX-conforming system. Better still, on mine it's 30% smaller than tar, and less than a third the size of cpio.

Peace through unity.

A little-known-by-me feature of pax is that it will even copy directory hierarchies. Me, I use cp -a to do this job, but that's a GNU-specific idiom. Other versions of cp may lack the -a flag.

This portable command will copy olddir into newdir (newdir must exist), preserving ownerships and permissions:
$ sudo pax -pewr olddir newdir
Mnemonic: the copy is pure ("pewr").

Wednesday, December 9, 2009

Octal NUL

"Well, here's another nice mess you've gotten me into." -- Oliver Hardy

It's disappointing, but not surprising, to see edge cases behave differently in different programs. It is surprising when they're inconsistent within one program: bash.

Oh, the behaviors are standards-conforming and well-documented. Still, watch this, keeping in mind that printf and echo are shell built-ins:
# within single quotes, the shell doesn't expand metacharacters
$ echo 'a\0000b' | od -c
0000000 a \ 0 0 0 0 b \n
0000010
# but echo -e interprets \n, \nn, \nnn, and \nnnn as octal characters
$ echo -e 'a\0000b' | od -c
0000000 a \0 b \n
0000004
# printf, however, only takes \n,\nn, and \nnn as octal
$ printf 'a\0000b' | od -c
0000000 a \0 0 b
0000004
# but $'...' is interpreted as a C string,
# and in C, \0 terminates a string
$ echo -e $'a\0000b' | od -c
0000000 a \n
0000002
In the last case, it's the shell interpreting the octal string before it even gets to echo.

Want proof?
$ cat <<< 'a\0000b' | od -c
0000000 a \ 0 0 0 0 b \n
0000010
$ cat <<< $'a\0000b' | od -c
0000000 a \n
0000002
Note also that much of this quirkiness only appears when you start using four-digit, octal representations and mess around with NUL (\0). Try keeping all those details in your head, bucko!

Me, I can't. Or won't. Good thing it's all documented in the man page.

"A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines." -- Emerson

Hat Tip: I started looking at this after puzzling over a line in Hal Pomeranz's latest Command-Line Kung Fu column.

Friday, December 4, 2009

Joinery

The join command has some useful options.

Hal Pomeranz has a nice example of using join to combine the output of two different commands in this week's Command-Line Kung Fu column.

After some discussion, he ends up with this:

$ join -1 1 -2 2 <(openssl sha1 * | sed -r 's/SHA1\((.*)\)= (.*)/\1 \2/') <(wc -c *) \
| awk '{print $2 " " $1 " " $3}'

One reason he uses openssl is to help teach that in process substitution, the contents of <( ) can be pipeline. If you relax his didactic requirement, join's options let you do the job with a lot less typing.

$ join -j2 -o 1.1,0,2.1 <(sha1sum *) <(wc -c *)

Note that sha1sum has the same output format as wc -c, which makes the join easier.

Non-Linux boxes might not have sha1sum, but if I didn't have it, I'd see if I had md5sum, which has the same output format. Their Wikipedia entries say these commands are widely available on lots of non-Linux OSs.