Sunday, March 2, 2008

Old Dogs, New Tricks: Spreadsheets and Running Medians

I look for excuses to use new tools.

A few months ago, I needed to report SVN use at work. Ah, an excuse.

I wrote code to extract the number of revisions per week, massaged it into a comma-separated-variable (CSV) file, pulled that into an OpenOffice spreadsheet (OOCalc), graphed it, printed the graph as a PDF, turned the PDF into PNG, and posted it on a blog that my boss could go to, whenever he wanted to look.

A lot of work.

Except that now I know how to convert PDF to PNG, how to import a graph into a blog, and a lot more about how to use spreadsheets.

The data extraction is a script that runs from a weekly cron job, and I now know how to use OOCalc templates and keyboard shortcuts to turn the data into a posted graph with just a few keystrokes.

Even Debi, our HR maven, can understand the graphs, and she's impressed.

The hard part, now, is just having a year's worth of data to analyze.

I've been recording my daily weight and pedometer readings, in hopes of using them as an excuse to learn something else about spreadsheets.

Today, it paid off.

Looking through my bookshelf, just now, I came on Mosteller & Tukey's Data Analysis and Regression. This book is quirky, but the authors may be America's most famous statisticians (if there is such a thing). John Tukey invented the words "software" and "bit." Their advice is good.

They say to try smoothing data with successive running medians, until the data stabilize.

"Can I do that with OOCalc?"

You bet. Google says it has a MEDIAN() function built right in.

Here are the steps I tried, in less time than it took to type this post.
  • Pick a column to hold the running medians. Click on a cell in that column, and type =median .
  • Go to the column of raw data, and highlight the three cells you want a median of with the mouse. They appear in the formula, as if by magic, with the correct syntax. (Don't forget to close the parens.)
  • Press Enter. Ta-da, the median appears in the selected cell.
  • Click the little button in the lower-right-hand corner of that cell and drag it down the page. All the cells in the column are instantly filled in with the right formulae and values. Now, I have an entire column of running medians.
  • Move over one column, and repeat, to get the running medians of the running medians.
  • Continue this until the data stabilize -- in my case, just one more time.
  • Graph it.

Tukey and Mosteller was written in 1977. I was still using a Monroe.

I'd just taken an elementary statistics class. Everyone ooh'd and ah'd over the HP-35 pocket calculator one of the students had bought. We'd never seen one before.

To calculate running medians, I wrote a program, in FORTRAN, on cards.

From seeing the book on my shelf, through discovering OOCalc's MEDIAN function, to getting a slick graph of the stabilized running medians took me perhaps ten minutes.

Okay, I had to collect two months' of data to have something to analyze, but it was worth it.

No comments: