Building a tenure-dossier pipeline

Tenure dossiers used to eat months of manual PDF assembly. A two-language pipeline does it in five minutes.

Months to 5 minutes per packet.

Node.js · Express · React · Python · BullMQ · Redis · pypdf · Interfolio API · Shibboleth · systemd

Tenure cases at Cornell live or die by a single PDF. Dozens of letters, evaluations, syllabi, and statements, each pulled from a separate place, each routed to its own numbered slot in a hierarchy the college committee expects to see exactly so. An administrative assistant in Nutritional Sciences once told me a packet "typically ate months" of careful work: download, classify, watermark, merge, repeat, then re-do the parts that came back wrong.

The brief was a web app where a staff member picks a candidate and gets a finished, bookmarked PDF on the other side. The shape of the problem turned out to be less about PDFs and more about two stubborn integrations on the way in.

The first was the Interfolio side. Their public API uses HMAC-SHA1 request signing1, and there is no Node library that speaks the dialect they want. A neighboring college had a Python desktop app that already handled the auth, so I read theirs and ported the signing flow into Node:

function generateHMACHeader(privateKey, publicKey, timestamp, requestString, requestVerb) {
  const verbRequestString = `${requestVerb}\n\n${timestamp}\n${requestString}`
  const signedHash = crypto.createHmac('sha1', privateKey).update(verbRequestString, 'utf8')
  return `INTF ${publicKey}:${signedHash.digest().toString('base64')}`
}

The two blank lines between the verb and the timestamp are not optional. Get them wrong and every request comes back with a polite 401 that does not tell you why.

The second integration was on the way out. Node's PDF tooling has a long-running gap: the libraries that merge cleanly do not preserve a hierarchical outline, and the libraries that write outlines do not merge cleanly. Python's pypdf exposes both, so Python stayed and became the worker. A staff member clicks build; the Node API validates the candidate's packet and pushes a job onto a Redis queue; a Python worker pops it, downloads each file from Interfolio with the same HMAC headers, and writes one merged-and-bookmarked PDF.

By hand

download · sort · watermark · merge · re-run

~monthsper packet

One click

pick candidate · build · download

~5 minper packet

The first time the assistant in Nutritional Sciences ran a real packet end-to-end, the build came back in roughly the time it takes to refill a coffee. The wall clock from clicking build to a finished PDF lands around five minutes, dominated by sequential downloads from Interfolio for packets in the thirty-to-sixty-file range. The job has not paged anyone since.

  1. Interfolio's developer documentation specifies HMAC-SHA1 over a canonical request string of VERB\n\n\nTIMESTAMP\nPATH, base64-encoded, with the INTF publicKey:digest envelope. The two empty lines after the verb are part of the spec; an unused header slot from an earlier version of the protocol.