
Bill George is someone I usually agree with. He's been a consistent voice for principled leadership and organizational culture. So when he posted enthusiastic support on LinkedIn for Harvard's recent vote to cap undergraduate A grades at 20% of any class, and pointed to HBS's own 1-2-3 grading system as evidence it "works well," I sat with that for a while before concluding he's wrong on two levels: the problem isn't diagnosed correctly, and the solution doesn't follow even if it were.
Is 60% A's actually a problem?
More than 60% of undergraduate grades awarded at Harvard in 2025 were A's, up from 24% in 2005. The faculty subcommittee called that grade inflation and voted to cap A's at 20% of any class. But is 60% A's actually a problem, or just a number that feels wrong?
Consider what we don't know. Harvard admits around 3% of applicants, selecting for academic ability more aggressively than almost any institution on earth (setting aside legacy admissions and the occasional building named after a donor's family...). Maybe a significant portion of those students genuinely earn A's on a well-designed, rigorous exam. Maybe the exam was too easy. Maybe the professor grades generously. Maybe the course material isn't demanding enough to differentiate outcomes. Maybe some combination of all of the above. The grade distribution alone tells us none of that. It's a symptom without a diagnosis.
The faculty subcommittee framed it as making grades "mean what they say they mean." But that framing assumes 60% A's is inherently wrong, which requires knowing what the grades should reflect, and that's exactly the question they skipped. How does an A in an undergraduate Harvard course compare to an A at a less selective school? Are the courses as demanding as Harvard's reputation implies? These aren't rhetorical questions; they're the ones that need answers before anyone reaches for a policy lever.
The wrong solution regardless
Even granting that something has drifted in Harvard's grading culture, capping A's at 20% is the wrong fix. It's the difference between criterion-referenced and norm-referenced evaluation. Criterion-referenced grading asks whether work meets a defined standard. Norm-referenced grading asks how a student ranks against peers. Harvard's new policy is purely norm-referenced: if 30% of students in a class produce work that genuinely merits an A, 10% will be graded down anyway, not because their work was deficient but because the math requires a loser.
Gregory Samanez-Larkin, a professor of psychology and neuroscience at Duke, left the sharpest comment in the LinkedIn thread: "1-2-3 works well for what? Honestly curious." Nobody produced a clean answer, which is telling.
The actual fix, harder but correct, is to define what an A requires and hold that line. If course material isn't rigorous enough to naturally produce a spread of outcomes, that's the problem to address. Capping the grade doesn't make the course harder. It just penalizes students for the instructor's design choices.
Jack Welch ran this experiment first
Jack Welch made the same error at massive scale with his "vitality curve," the 20-70-10 system at GE where the top 20% were rewarded, the middle 70% coached, and the bottom 10% fired annually. Mark Graban of Lean Blog asked the obvious question: if GE had to remove the bottom 10% every year, why did GE keep hiring turkeys?
The vitality curve assumes any workforce naturally distributes along a bell curve, so you might as well act on it. But a well-hired, well-developed team isn't guaranteed to produce a bottom 10% of underperformers. A great leader who recruits carefully, trains well, and sets clear expectations might build a team where the weakest performer would be a star somewhere else. Forcing the ranking anyway fires people not because they failed a standard but because the math requires someone at the bottom.
The variables that actually explain underperformance are the same ones the curve ignores: unclear job requirements, inadequate training, a talented person in the wrong role, compensation too low to attract strong candidates, or simply a manager who tolerates mediocrity and then blames the team for it. Removing the bottom 10% addresses none of those. It just produces a vacancy and restarts the cycle.
The question neither system asks
Microsoft ran Welch's stack ranking for years, then abandoned it entirely, citing the damage it did to collaboration and internal innovation. The research on forced ranking is consistent: it undermines teamwork and shows no reliable correlation between individual rankings and actual organizational performance.
Both Harvard's grade cap and Welch's vitality curve skip the hard diagnostic work in favor of a mechanical fix that feels rigorous. Mandating a distribution is easy. Defining what mastery requires, building courses or jobs demanding enough to test it, and developing the people who fall short — that's the actual work.
When you force or rig the curve, you don't raise performance. You just guarantee somebody loses.