Why Distilled Models Are Brittle to Fine-Tuning (Explained Through Basketball)
- Satyavrat Bondre
- Feb 12
- 4 min read
Or: What I learned about model compression by comparing myself to Michael Jordan

The Problem Everyone Runs Into
Last week, someone in my LinkedIn comments mentioned they were struggling to fine-tune a distilled OCR model. The model was excellent at table extraction with genuinely impressive performance. But when they tried to adapt it for their specific use case, it immediately overfitted and performed worse than baseline.
This isn't unusual. I've seen this pattern repeatedly: teams pick a distilled model because it's efficient and performs well on benchmarks, then hit a wall when they try to customize it.
The confusion is understandable. If a distilled model is "just a smaller version" of a larger model, shouldn't it fine-tune the same way? Just... with less capacity?
Well, no. And understanding why requires understanding what distillation actually does.
The Basketball Analogy (Yes, Really)
I'm going to explain this through basketball. Bear with me.
Two people learn basketball:
Option A: Michael Jordan
MJ spends years becoming perfect at basketball i.e. becoming MJ
Offensive strategy, defensive positioning, reading the game, when to pass, when to drive, how to adjust to different opponents. He learns the fundamentals of basketball: why certain plays work, how different skills combine, what makes someone effective on the court.
MJ understands basketball holistically. He's learned principles, not just moves.
Option B: I Copy Michael Jordan's jump shot
Instead of learning basketball from games, I trained only on recordings of MJ’s decisions. Specifically, jump shots.
I watch footage of MJ shooting. I study the mechanics obsessively: the footwork, the release point, the arc, the follow-through. I practice this one motion thousands of times until I can replicate it perfectly. Yes, this is an exaggeration because in reality I'd copy a few key plays, not just one shot but the point is I'm copying outcomes without the full underlying game model.
My jump shot looks exactly like Jordan's. Same form, same consistency, same beautiful arc.
But that's all I know. I never learned why the jump shot works in the context of basketball strategy. I don't know how to create space for it, when to take it versus passing, how to adjust it against different defenders. I learned to mimic the output without understanding the system it exists within.
This is the difference between a base model and a distilled model.
Base model (Option A/MJ): Learns representations, patterns, and generalizable knowledge
Distilled model (Option B/Me): Learns to reproduce the teacher's behavior on the distilled dataset, often with less capacity and less slack for new tasks

(Cut from training labs to real world tournaments) Now, Both of Us Need to Adapt
Here's where it gets interesting.
A local youth basketball program asks both of us to coach their team. Neither of us has coached elementary school kids before. This is a new task for both of us and not something either of us was explicitly trained for.
Michael Jordan's advantage:
MJ can adapt. He understands basketball at a fundamental level. He knows why certain strategies work, how to simplify concepts for beginners, how to adjust plays for different skill levels. He can transfer his knowledge to this new context because he learned principles, not just outputs.
He's expensive to hire (let's say $1 billion), and it takes him time to understand the specific constraints of youth basketball (shorter attention spans, different rules, varied skill levels). But once he does, he's incredibly effective because he has the underlying knowledge to adapt.
My limitation:
I can teach the kids to mimic Jordan's jump shot.
I'm great at that because it's exactly what I learned. But that's it.
When a kid asks "why do we shoot this way?" I don't have an answer beyond "because this is how Jordan did it."
When the kids need to learn defense, passing, or game strategy, I can't help. I never learned those things. I only learned one specific output.
Worse: if I try to teach them full basketball, the team might actually get worse.

They'll become hyper-specialized in jump shots while neglecting every other fundamental skill. The local PE teacher, who knows basketball broadly but not at an elite level, would actually produce better-rounded players.
This is what happens when people try to fine-tune a distilled model.
Mechanically, smaller models have less slack capacity, end up with sharper decision boundaries, and lean harder on teacher-specific shortcuts. When you fine-tune on small data, those shortcuts get destabilized and performance craters.
So Which Do You Pick?
Fixed task, known distribution, need speed? Distilled model. Great choice. (I would have been the greatest shooting coach that team ever had if they had given me that role instead of the head coach role)
Evolving task, custom domain, need flexibility? Base model. Don't fight it.
Red flags you picked wrong:
"We'll customize it later"
"It's good at X, we just need to teach it Y"
"We can always upgrade if needed"
If you're thinking these thoughts, start with a base model.

Back to the Basketball Court
Remember: I can shoot exactly like Michael Jordan (for 1/100,000th the price).
If all you need is someone to hit jump shots in a very specific context, I'm a bargain.
But if you hire me expecting I'll learn defense, passing, or game strategy later, you're going to be so disappointed. That's not what I was built for.
The same is true for distilled models.

They're incredible at what they do. But they're specialists, not generalists. And specialists rarely become generalists, no matter how much you fine-tune them.
One caveat worth noting here is that distillation isn't purely "copy outputs, lose everything." Done well, it can transfer internal structure. You can also mitigate the rigidity with adapters or re-distillation for your specific task.
But you don't magically recover the missing capacity. Mitigation isn't restoration.
Choose accordingly.




Comments