The Final Whistle: What 18 Weeks of AI Football Picks Actually Taught Me

Back in September, I started what seemed like a straightforward experiment: pit three leading AI platforms against each other (and me) in my annual NFL football pool. Give them all the same weekly prompt with access to injury reports, team records, point differentials, and betting lines. Explicitly instruct them to take some risks rather than just following the Vegas odds. Then sit back and see who comes out on top.

Simple science, right?

Well, 271 games and 18 weeks later, the results are in. And they tell a more nuanced story than "human beats machine" or "AI dominates sports prediction." The truth, as usual, is more interesting than either headline.

The Final Standings

After 18 weeks of regular season picks:

  • My picks: 177 correct (65.3%)
  • Claude: 173 correct (63.8%)
  • Perplexity: 171 correct (63.1%)
  • ChatGPT: 159 correct (58.7%)

I won. The human beat the machines. Queue the triumphant music and hot takes about AI limitations, right?

Not so fast.

That 6.6 percentage point gap between me and ChatGPT represents about 18 games over the season. The gap between me and Claude was just 4 games. In a pool where weekly winners are decided by a single game or a tiebreaker, these margins tell you something important: we were all playing in roughly the same ballpark.

What the Numbers Actually Reveal

A friend who's better at statistics than I am looked at my data and pointed out something I'd missed. The real story isn't just about who had the highest accuracy. It's about the relationship between accuracy and consistency.

Looking at the week-to-week variance in our performance, a clear pattern emerges: Perplexity was the most consistent, followed by Claude, then me, then ChatGPT.

In other words, Perplexity was the most stable performer. It rarely had terrible weeks, but it also rarely had spectacular ones. It's the index fund of football prediction. Claude was pretty consistent too. I was all over the map, with some dominant weeks and some absolute stinkers.

Here's the thing: in a football pool with 30 participants, being consistent doesn't help you win. You need to beat the field, which means you need to take calculated risks and be right when others are wrong. The weeks when you nail a few contrarian picks that the crowd misses? Those are the weeks you win the pool.

I won four weeks this season. That's far better than random chance would suggest in a 30-person pool, and it's precisely because of those high-variance weeks where my gut diverged from consensus and turned out to be right. If I had Perplexity's consistency, I'd probably have won fewer weeks even if my overall accuracy were the same.

This gets at something fundamental about the difference between forecasting and competing. Perplexity optimized for accuracy. I optimized for winning a pool. Those are related but distinct objectives, and the data bears this out: the most consistent system didn't win a single week.

The Tale of Two Chaos Weeks

Weeks 5 and 6 perfectly illustrate this dynamic. In Week 5, I managed just 5 correct picks out of 14 games (36%). The AIs? ChatGPT got 3, Perplexity 6, Claude 5. Nobody could figure out what was happening. Upsets everywhere. Even the favorites that did win barely squeaked by.

Week 6 wasn't much better. I scraped together 7 correct out of 15. ChatGPT completely cratered with just 4. Only Claude rose above 50% with a solid 10 correct, showing that its deeper analytical approach could handle some chaos better than the others.

These chaotic weeks should theoretically favor AI. If there's some hidden pattern in the data that signals when underdogs are about to shock everyone, machine learning ought to find it. Instead, the chaos seemed to amplify our different approaches rather than reveal any AI advantage.

Interestingly, when the season stabilized in the second half, the patterns shifted. Perplexity had its best week in Week 10, nailing 12 out of 14. Claude dominated Week 17 with 12 out of 16. I had strong showings in Weeks 3, 7, and 15. The variance in who performed well each week was almost as interesting as the overall standings.

Claude's Quiet Excellence

While I came out on top overall, Claude deserves credit. With 63.8% accuracy, it was the strongest single AI platform, and it showed remarkable analytical depth throughout the season. In Week 1, it nailed 14 out of 15 picks. In Week 17, when most of us were struggling, it hit 12 out of 16.

What distinguished Claude from the other AIs was its willingness to synthesize multiple factors beyond just injury reports and betting lines. It seemed to actually understand football context: momentum shifts, coaching tendencies, matchup advantages. It felt less like a pattern-matching algorithm and more like someone who actually follows the sport.

ChatGPT, despite being conversational and easy to work with, never quite found its footing. At 58.7%, it spent most of the season trailing the pack. Perplexity, meanwhile, played it safe and ended up exactly where you'd expect a consensus-following system to land: solidly middle of the pack.

The Week 18 Wild Card

The final week of the season introduced a variable none of us could fully account for: strategic rest decisions. Teams like the Eagles and Chargers, having already locked in their playoff positions, chose to sit key starters in Week 18. These decisions often came late, sometimes just hours before kickoff.

Everyone went 9 for 16 that week. We all got beat by the same uncertainty. When teams aren't trying to win, prediction models break down. You can analyze all the injury reports and historical matchups you want, but if half the starters aren't playing and nobody knew until Thursday, you're just guessing.

This is actually a perfect microcosm of what the entire season revealed about AI prediction: it excels when the inputs are stable and patterns are consistent. It struggles when context shifts in ways that aren't captured in the data.

What I Actually Learned

I went into this experiment half-expecting the AIs to dominate. This seems like exactly the kind of task they should excel at. Pattern recognition over large datasets, probabilistic reasoning, synthesizing multiple variables. These are supposed to be AI's superpowers.

But sports prediction remains stubbornly, beautifully human. The better team doesn't always win. Coaches make bizarre decisions. Players have bad days. Backups come out of nowhere. And apparently, a retired guy with an internet connection and too much time on Sundays can still outperform sophisticated AI systems.

My edge wasn't superior football knowledge. I'm not watching film or tracking advanced metrics. But I do absorb context: which teams are trending up, which coaches are on the hot seat, which offensive lines are clicking. That qualitative sense, combined with a willingness to make contrarian picks when my gut says the conventional wisdom is wrong, turned out to be worth something.

The AIs, despite having access to far more data than I could ever process, couldn't quite capture that intuitive synthesis. They approached each week as a fresh optimization problem. I approached it as a continuation of evolving narratives.

The Real Insight

If I had to summarize what 18 weeks of this experiment taught me, it's this: AI is phenomenally good at finding patterns in data, but it's not magic. When the patterns are stable and the inputs are comprehensive, it performs well. When you need to weigh conflicting signals, incorporate soft context, and make judgment calls under uncertainty, human intuition still has real value.

This doesn't mean AI is useless for sports prediction. Claude's 63.8% accuracy is nothing to sneeze at. If I were building a betting model or running survivor pools, I'd absolutely incorporate AI analysis. But in a competitive pool where you need to beat the field, not just be correct, there's still room for human judgment.

And maybe that's the broader lesson about AI in 2025. It's not about humans versus machines. It's about understanding what each does well and using them appropriately. The AIs gave me better baseline analysis than I could have generated on my own. But the final decision to deviate from that baseline, to take calculated risks in specific spots, to trust my gut when something felt off? That's where I added value.

Looking Ahead

The experiment is over for now. The playoffs start this weekend, and I'm planning to just enjoy them without the weekly grind of making picks. But I'll be back next season with some refinements.

Final score: Humans 1, Machines 0.5 (Claude gets partial credit). But it was closer than I expected, and I have a feeling the machines are learning faster than I am.

Share