Friday, August 28, 2009

Hash Collisions in Git

This post would be easier if I knew how to post formulas in Blogger. If you know, tell me.

In the Pro-Git Book, Scott Chacon notes that to have a probability of a SHA1-hash collision rise to 1/2, you need about 10^24 objects (more, he says, than the number of sand grains on earth).

Where's this come from? It's a classic birthday problem. If we have K people, with birthdays on any of 365 days, what are the odds of two or more people having the same birthday?

P(same) = 1-P(everyone's different).

To assign different birthdays to K different people, we can give the first person any of 365 birthdays, the next any of the remaining 364, and so on. The Kth person has a choice of 365-K+1 days.

Turning this into a probability means dividing by 365^K -- the number of different ways to assign birthdays to K folks if you don't care about collisions.

But what's that product? Just multiply it out:

X = 365(365-1)(365-2) ... (365-K+1) = 365^K - [1+2+...+(365-K+1)]365^(K-1) + something,

Where something is O(365^(K-2)) (That's big-Oh notation, if it's not clear.)

Dividing by 365^K gives, approximately, P(no collision) = 1-[K(K-1)/2]/365,
or P(collision) = (1/365)K(K-1)/2 .

For SHA1 sums, we replace 365 by the number of 160-bit SHA1 values: 2^160.

How big does K have to be for (1/2^160)K(K-1)/2 to be about 1/2? About
K ~ sqrt(2^160) = 2^80, or about 10^24!

Another quick applications of this idea comes from the fact that the SHA1 prefix git is now requiring to specify unique kernel objects is 10 hex digits long. We can, therefore, guess the database now has something on the order of 2^20, or a million, distinct objects.

And then there's the original birthday party question. If you have sqrt(365) people in the room -- that's 17 or 18 -- you have about a 50-50 chance that two will have the same birthday.

This would, however, been clearer if I could have stuck real formulae in the post.

Using Git Stash as a Snapshotter

I use git stash to snapshot my work.

I even use it to snapshot my home directory. I have a cron job that periodically does, basically, this:
git stash
git apply
to squirrel away any work I haven't checked in yet.


Until today, it did exactly that, but a few days ago, I started having a Makefile change get undone. My old version kept coming back up like bad Mexican food. Hmm.

The bad (old) version was in my stash. What would happen was that git stash would run, figure out I had no recent changes, and do nothing. Then, git apply would run, and restore the last un-checked-in code in the stash (my bad Makefile).


Now, the code says this:
git stash | grep -q 'No local changes to save' ||
git stash apply

Wednesday, August 26, 2009

Remember When?

My uncle used to send "Remember when?" email to giant lists of people.

You've seen them from folks who are feeling old. "Remember when penny candy cost a penny?" ... things like that. Mostly, my answer was, "No."

I was walking back from lunch with David Russell, today, when a woman stopped us to ask for directions. I said, "I'd look on Google Maps, but I left my phone at home."

Then I thought about that for a second, and said, "Remember when we always left all our phones at home?"

Remember when people used to send email to giant mailing lists instead of blogging or Tweeting? You're probably answering, "No."

Are We There, Yet? Using fuser() to See If The Job's Done.

Ever write a script that ends before the things in it should?

The symptom is a file that gets partially written or partially copied. Sometimes the output files are zero-length. One example is the batch-mode sftp I wrote about last post.

Here's an illustrative example, followed by a trick to solve it.

In the empty-sftp program, empty spawns an sftp session, which runs in parallel, waiting for me (really empty, but it thinks it's me) to type ftp commands. If I only want to run one command

sftp> get filename

I can exit my script after issuing the command, but the pty will still be running. I can kill the pty before I exit the script, but if the file transfer hasn't finished, the kill will truncate the output.

The basic problem is that there is no process that I can watch that terminates after the file transfer ends; like any shell (which is really what it is), sftp just keeps running, waiting for the next command.

The trick:

When the transfer's over, sftp closes the output file. We can watch to see when that happens, like this:
while true
sleep 1
/sbin/fuser $localfile || break


As soon as it finishes, we kill the sftp session, and exit.

Empty Turns "Hard" Into "Not-So-Hard"

Some things should be personal, even when you're at a computer.

Logging in, for example.

Good, Linux, command-line tools read from standard in and write to standard out, which are connected, by default, to the keyboard and screen. I try out programs by typing at them. When I'm convinced I understand how they work, I can redirect I/O, to give them input from a file and capture their output in a second.

The upside-down is true, too: I can automate things I typically do interactively, just by having the programs do I/O to files instead of terminals. I've written automated test suites for screen editors with this trick.

Logging-in is different. When the login program prompts you for your name and password, it puts the prompts directly on the screen and reads the answer from the keyboard. Stdin and stdout aren't part of the picture.

This not only makes it hard to automate, it's supposed to make it hard to automate. It's hard to make robots to batter down the system's doors, trying user name after user name and password after password for days on end.

It's a security feature. When a program seems not to be taking input from stdin, I think, "Is this like logging in?"

So, how do you automate those un-automatable jobs?

Don Libes, at NIST, in Gaithersburg, Maryland, gave us the first, good solution: Expect. Expect creates a pseudo-terminal whose virtual keyboard and screen are FIFOs that you can write to and read from. Expect programs are mostly lists of the strings you expect to see, like login:, paired with the responses you want Expect to send when it sees those strings.

It's wonderful. I've used Expect for everything from testing printers to creating a make replacement that drives builds on Tandem's archaic Guardian OS from Unix machines running Expect scripts.

Unfortunately, Don picked the wrong horse. Don built Expect on Tcl, a popular programming language in the early nineties, and used Tcl's syntax. Today, Tcl is no longer in fashion; for most folks, writing Expect is like writing Latin.

Nowadays, Perl, Java, and other languages, have Expect-like extensions, too, but they all feel fragile and hard to use.

Yesterday, I needed to automate an sftp session, without any way to install public keys on the target machine. I thought about Tcl, winced, then looked at empty, which is supposed to offer the same magic in shell scripts.

Installation was just this:

$ sudo apt-get install empty-expect
The program I wrote was short, simple, and it worked. The meat was these lines, trivial modifications of examples in the man page:
# set up a session
empty -f -i in.fifo -o out.fifo -p -L $log sftp $username@$remote_machine
empty -w -i out.fifo -o in.fifo password: $password'\n'
# transfer the file
empty -w -i out.fifo -o in.fifo sftp "get $remotefile $localfile"'\n'
For things that should be personal but can't be, I recommend empty. At least, on your computer.

Thursday, August 20, 2009

How Do I Find Missing Files in a Sequence?

In this week's Command-Line Kung Fu, Hal Pomerantz & Ed Skoudis field this question from a reader:

I have about 1300 pictures that I am trying to organize.
They are numbered sequentially eg. ( 0001.jpg -> 1300.jpg )
The problem is that I seem to be missing some...

Is there a fast way to be able to scan the directory to see which ones I am missing? Other than to do it manually, which would take a long time.
While they provide good, loop-based solutions, like this one:
$ for i in $(seq -w 1 1300); do [ ! -f $i.jpg ] && echo $i.jpg; done
and this
for ((i=1; $i <= 1300; i++)); do file=$(printf "%04d.jpg" $i);[ ! -f $file ] && echo $file; done
they don't mention bash brace expansion. In bash 4.0 and above, this does the trick:
$ ls {0001..1300}.jpg >/dev/null
No loop. No subprocess. Quite readable.

Include Guards in Sourced Files

How do I keep from sourcing the same files over and over?

My shell scripts often have lines like this:
Unfortunately, so do other sourced files. How do I keep from sourcing something more than once?

In C programs, my include files handle this with constructs like this:
// foo.h -- include file for foo definitions
#ifndef _FOO_H
#define _FOO_H
The first time foo.h is included, it grabs all the included information. If it's accidentally included a second time, though, _FOO_H is already defined, so the file is skipped.

Here's an analogous construct that I use at the top of shell "include" files.
# include guard
if [ ${_gripe_version:-0} -gt 0 ]
return 0
... # remainder of file
I could put the file contents inside the else block, and the fi at the end of the file, but I like just clumping the whole thing at the top.

Monday, August 10, 2009

Another Use For The Colon (:) Command

I've already blogged about the shell's no-op command, when I use it, and why.

I forgot about another way I use it. Here 'tis.

Consider this loop:
for i in {1..100}

some-command-or-other $i
If you want to comment out the loop, you could comment out the whole thing by putting a comment character in front of each line, but that's tedious. Easier would be to comment out the one-line loop body, like this
for i in {1..100}

# some-command-or-other $i
Except, um, that's an empty loop body and a syntax error.

bash: syntax error near unexpected token `done'
This, however, is fine:
for i in {1..100}

: some-command-or-other $i
That loop body isn't empty, it just doesn't do anything. No fuss, no muss.

Friday, August 7, 2009


I have seen the future of retail and it is my phone.

Today, Amazon announced an Amazon Market app for the Android. (There's one for the iPhone, too.)

You point the camera at the barcode of an item and push a button. It scans the barcode, then looks to see whether Amazon stocks the item. If so, it puts the item onto your wish list, so you can buy it (or so someone else can buy it for you).

What would you pay? But wait. There's more.

If Amazon doesn't have your item, they offer another option. Select the other button, and your camera takes a picture of the item, sends it up to Amazon, and a minute or less later, Amazon tells you *similar* items that they stock. And puts them on your wish list.

How do they do this? Is it OCR? Is it image recognition? Is it Amazon Mechanical Turk? I have no idea.

And, frankly, who cares?

They make it easy to buy. From them. This is brilliant. Better even than the browser-based "Buy this together with ...." Better even than "You might also be interested in." Better even than "People who bought this item also bought ...."

I have seen the future of retail and it is my phone.