A weekend in place of a quarter

Three years of towing records, one screen at a time, due now.

tl;dr About 28,000 records pulled, enriched, and merged over one weekend. Submitted on time.

used: Python, Excel, Google Drive

A tax audit asked a small towing company for three years of records: every motor-club call, with location and payment, so the sales tax could be worked out exactly. The records existed. One of the owners sat me down and showed me where: a national billing portal that takes five to fifteen seconds to show the list of calls, five to fifteen more to open one, then 30 to 60 seconds of reading and writing down before the page refreshes and the whole thing starts over for the next call. About 28,000 calls. At a minute or more each, that is months of full-time clicking; at the pace of an office that also has a business to run, it may have taken years.

I wrote a Python script to do what the office would otherwise have done: walk the months, open every record, write it to disk. A half-second pause between records, and a progress file so an interruption costs nothing but the time since the last checkpoint. The first hour found the usual surprises, mismatched fields on some records and a rate limit, so I adjusted and let it run. The pull was done before sunrise.

Three years of records, two ways

by hand, through the UI a minute or more per call, every call the script ~4 h, one Friday night about 28,000 calls, same time scale

The pull was only half of it. Raw records answer "what happened"; a tax audit wants "where" and "at what rate," and the calls were split across three motor clubs, only one of which was the hard one. So the pipeline ran in order: walk the hard club's call list and fetch every call by ID into a sheet, aggregate and normalize it with the other two clubs' exports, and only then do the tax work on the one master sheet. Reduce the addresses to a unique set and put a zip to each; reduce the zips to a unique set and put a tax rate to each; propagate it all back into the master and calculate the tax per call. Working with unique sets instead of 28,000 rows meant a few hundred lookups instead of tens of thousands. Then clean up and package.

The pipeline

pull the call list fetch every call by ID merge with the other clubs' exports join in zips and tax rates calculate the tax per call package into Google Drive

Started Friday night, done by Monday: the full date range, enriched, in one CSV in the company's Google Drive, openable by anyone with Excel. Everything was submitted on time. The script was a disposable tool for a one-time job; the records are the durable thing, and they are plain files now, readable by anyone the company hires for the next audit.

↑ top