Excerpted from 55 and Out, AI Automation
I Built a GPT That Reads My Inbox. It Misses 4% and That's Fine.
· 3 min read
The classifier I described a few weeks back has been running for six months. Here's what I've learned about being okay with imperfect automation.
In six months it has classified roughly 9,200 emails. I audited a random sample of 500 against my own judgment. It got 480 right. That's a 96% accuracy rate.
Of the 20 it got wrong: - 11 were FYI it labeled trash — newsletters with one piece of news I would have wanted to see - 6 were reply needed it labeled FYI — usually short, casual asks from people who didn't include a question mark - 2 were trash it labeled reply needed — fancy phishing - 1 was a flight change buried in marketing chrome (the one that prompted my allow-list rule)
So: 14 misses where I might have missed something useful. Over six months. About one every two weeks.
Compare that to the alternative — me reading every email. In six months I would have spent roughly 90 hours processing inbox by hand at my old rate. The classifier costs me about 4 hours a month to skim the FYI bucket and audit. Net savings: ~70 hours.
The 4% I'm willing to lose is worth the 96% I get back.
What I had to make peace with: the idea that "missing one" is a normal cost of leverage. I used to feel a small panic at the idea of letting a script decide what I see. Then I noticed I had been letting Gmail's algorithm do exactly that for fifteen years, just with worse rules. At least I wrote my rules myself.
What I'd do again: build it cheaper next time. The first version used the most expensive model available. The second version uses a small model and is 98% as good for 4% of the cost.
What I wouldn't: try to push it to 100%. Every percentage point past 95 doubles the build time and triples the prompt complexity. The juice isn't worth the squeeze. Build the 95% version, accept the misses, get on with your day.