How to Vet ML Engineers: A Technical Hiring Framework for 2026

Why Traditional ML Interviews Fail: The LeetCode Problem

Let's start with the hard truth: most ML hiring processes are broken because they measure the wrong skills.

The standard approach looks like this: coding interview (LeetCode), behavior interview, maybe a take-home project. This works fine for backend engineers where algorithmic thinking directly translates to job performance. But ML is different.

An ML engineer who crushes LeetCode might spend months choosing between XGBoost and a neural network without validating the choice against production metrics. They might build a beautiful model that leaks training data. They might not know how to debug when accuracy drops in production.

The predominant approach across tech companies still treats algorithmic coding ability as the primary filter for ML roles. Yet in practice, coding speed is a poor predictor of actual ML engineering performance in production.

The Core Problem: LeetCode tests raw algorithmic speed. ML engineering is about judgment—knowing when a simple model is better than a complex one, understanding data constraints, debugging production failures. These skills don't appear in interviews—they appear in shipped work.

The 4 Dimensions of a Strong ML Engineer

Before we design a vetting process, we need to define what we're actually looking for. Not all ML engineers are the same, and different roles weight these dimensions differently.

1. Modeling Skill

Can they choose the right model for a problem? Do they understand trade-offs (accuracy vs. latency, interpretability vs. performance)? Have they worked with different modalities (tabular, NLP, vision)?

How to evaluate: System design interview + domain depth questions. Not: LeetCode.

2. Engineering Discipline

Can they write clean, maintainable code? Do they test their models? Can they refactor messy code? Do they think about monitoring and observability?

How to evaluate: Code review exercise + portfolio inspection. Not: coding speed.

3. Product Sense

Do they ask about business metrics before jumping to models? Do they understand the difference between accuracy and value? Can they communicate with non-technical stakeholders?

How to evaluate: System design (what questions do they ask?) + reference checks.

4. Communication

Can they explain decisions to other engineers? Can they present findings to stakeholders? Do they document their work?

How to evaluate: Throughout the interview process. Do they explain their thinking?

Most traditional interviews only measure dimension #2 (engineering discipline) through coding, and they miss dimensions #1, #3, and #4 entirely. That's why the process fails.

The 4-Stage Vetting Framework That Works

This is the framework we use at VAMI. It takes 4-6 weeks from initial review to offer. Here's how each stage maps to the four dimensions:

Stage 1

Portfolio Review

What to look for on GitHub and in past projects

Green Flags ✓

• Shipped models to production
• Post-mortems of failures
• Data exploration notebooks

Red Flags ✗

• Only toy datasets
• No production context
• Copy-pasted Kaggle solutions

Stage 2

System Design for ML

Real problem, evaluate thinking process

Green Flags ✓

• Asks about data first
• Discusses trade-offs
• Mentions monitoring and iteration

Red Flags ✗

• Jumps to latest model
• No data quality questions
• Over-engineers immediately

Stage 3

Code Review Exercise

Give messy ML code, ask them to critique

Green Flags ✓

• Identifies data leakage
• Spots training/inference mismatch
• Suggests refactoring for maintainability

Red Flags ✗

• Only comments on style
• Misses statistical issues
• Can't explain why code is problematic

Stage 4

Domain Depth Interview

5 questions that separate senior from mid-level

Green Flags ✓

• Discusses trade-offs explicitly
• Brings up production constraints
• Shows failure experience

Red Flags ✗

• Textbook answers
• No real examples
• Only theory, no practice

Stage 1: Portfolio Review (What to Look For)

Start here. A GitHub profile tells you more than any interview. Look for:

Shipped vs. Toy Projects

Have they shipped a model to production? How do you know? Look for: deployed endpoints, real user data (even anonymized), monitoring/logging code, documentation of production decisions.

Evidence of Iteration

Good projects show multiple commits, experiments, failed approaches. Red flag: a single commit with a polished model (likely copied from a tutorial).

Data Exploration

Do they have notebooks showing EDA? Do they discuss data quality? Or do they jump straight to modeling?

Post-mortems of Failures

The best signal: a README or issue that says "This approach didn't work because... Here's what I learned." This shows judgment and humility.

If a candidate has weak GitHub presence, ask them to walk you through their last shipped project (even if it's internal). Can they explain the data pipeline? Where did it fail? How did they debug it? Their answers matter more than commit history.

Stage 2: System Design for ML (Evaluate Thinking, Not the Answer)

This is where you learn if they think like a production engineer or a researcher. Give them a real problem and watch how they approach it.

Example Problem

"Build a fraud detection system for a payment processor. The business wants to block 95% of fraud while keeping false positives under 1%. Transactions happen in real-time. You have 2 years of historical data."

What to listen for:

✓ Do they ask about data first? "What features are available? How imbalanced is the dataset? What's the cost of a false positive vs. false negative?"
✓ Do they discuss trade-offs? "If we increase recall, false positives go up. How sensitive is the business to that?"
✓ Do they think about iteration? "I'd start with a simple model, measure performance, then iterate if needed."
✗ Do they jump to latest models? "Let's use a transformer" (without understanding data, scale, or constraints).
✗ Do they forget monitoring? No mention of how to detect model drift, data quality issues, or performance degradation in production.

Give them 30-45 minutes. Interrupt if they go too deep into implementation details. You're not looking for the perfect design—you're looking for the thinking process.

Stage 3: Code Review Exercise (Give Messy Code, Ask Them to Critique)

This is the clearest test of engineering discipline. Take a real ML script (with intentional issues) and ask the candidate to review it. Spend 45 minutes.

What kind of issues should the code have?

• Data leakage (preprocessing before train/test split)
• No cross-validation
• Hardcoded paths and hyperparameters
• Missing error handling
• Inconsistent random seeds
• No validation of input data

What you're looking for:

✓ Do they identify data leakage? (This is the #1 mistake in ML code, and senior engineers spot it instantly.)
✓ Do they suggest testing strategies? (Not just "write unit tests," but "cross-validate" and "hold out a test set".)
✓ Do they think about reproducibility? (Random seeds, versioning, logging.)
✗ Do they only comment on style? (Indentation, variable names—these matter, but not as much.)
✗ Do they miss the statistical errors? (Data leakage, overfitting.) This is a dealbreaker.

This exercise reveals more about real-world performance than any LeetCode problem. A senior ML engineer will spot data leakage in seconds.

Stage 4: Domain Depth—5 Questions That Separate Senior From Mid-Level

After 3 stages, you have a good sense of the candidate. The final round is a 60-minute conversation with a senior engineer. These 5 questions separate senior-level thinking from mid-level:

"You're building a recommendation system for e-commerce. Initial model has 92% accuracy on a holdout set. How do you know if that's good?"

What this evaluates: Can they think beyond accuracy? Do they ask about business metrics (CTR, conversion), class imbalance, and data leakage?

"Walk me through the last time a model you built in production failed. What happened?"

What this evaluates: Do they have production experience? Do they learn from failures? Red flag: 'It never failed' or vague answers.

"You're training a model that takes 6 hours per iteration. How do you approach hyperparameter tuning?"

What this evaluates: Do they think about iteration speed and resource constraints? Or do they default to GridSearch?

"Your model's performance dropped 3% in production after it was working well in staging. What are the first three things you check?"

What this evaluates: Data drift? Model staleness? Distribution shift? This separates engineers who've debugged production systems from those who haven't.

"When would you NOT use deep learning, even if you had the data?"

What this evaluates: Do they know the trade-offs? (Interpretability, latency, data requirements, maintenance burden.) This is senior thinking.

These aren't trick questions. You're not looking for specific answers—you're listening for whether they think like someone who's shipped and maintained models. Senior engineers will give nuanced answers with trade-offs. They'll mention failure. They'll ask clarifying questions.

Green Flags vs. Red Flags Across All Stages

Universal Green Flags ✓

• Talks openly about failures and what they learned
• Asks about data quality before suggesting a model
• Explains trade-offs (accuracy vs. latency, interpretability vs. performance)
• Shows curiosity about production constraints
• Mentions monitoring and debugging in production
• Asks clarifying questions instead of jumping to answers
• Can explain why they chose a particular approach
• References both papers and production experience

Universal Red Flags ✗

• Can't explain why they chose a model (just "it's state-of-the-art")
• No production experience or only academic projects
• Only talks about theory, no mention of debugging or iteration
• Misses data leakage in code review
• Hasn't shipped anything end-to-end
• Can't discuss trade-offs or constraints
• References only academic papers, no real-world examples
• Defensive about feedback or past mistakes

How to Structure a Take-Home Assignment

If you use a take-home, structure it carefully. A poorly designed assignment will scare away good candidates or attract candidates who have unlimited time to over-engineer.

Time Scope	3-4 hours maximum. If candidates spend 10+, they're over-engineering.
Dataset	Real or realistic. 5K-100K rows. Messy enough to require exploration, not toy-clean.
Problem	Clear business goal. Not 'predict X'—more like 'we want to reduce churn by identifying at-risk customers'.
Evaluation Criteria	Process > polish. Show iteration, not the final answer. Did they explore the data first?
Submission Format	Code + brief written explanation. No 50-page reports. We want to see thinking, not perfectionism.

What to evaluate in the submission:

✓ Did they explore the data first or jump to modeling?
✓ How did they justify their modeling choice?
✓ Did they check for data leakage?
✓ Did they mention limitations or next steps?
✗ Is the code over-engineered with unnecessary abstractions?
✗ Did they spend 10+ hours (a sign they're trying too hard to impress)?

Follow up with a 45-minute discussion about their submission. Ask them to walk you through their thinking. This is where you'll learn if they actually did the work or if they followed a template.

Reference Checks Specific to ML Roles

Most reference checks are generic. For ML roles, ask specific questions that reveal production judgment. Here are the questions we use:

Can you describe a specific ML model this person shipped to production? What was the business problem it solved?

Did they work effectively with data engineers and product managers, or were they siloed?

Have you seen them over-engineer a solution? Give an example.

When their model didn't meet expectations in production, how did they debug it?

Would you hire them again for an ML role? Why or why not?

Red flag answers: "They never shipped anything," "I'm not sure what they actually built," "They seemed more interested in research than production."
Green flag answers: Specific examples of shipped work, mentions of debugging production issues, examples of working across teams.

Why This Matters: The Cost of a Bad ML Hire

A bad ML engineer hire can consume many months of lost team productivity. Here's how that time disappears:

Months 1-2: Onboarding

Getting up to speed on your data, infrastructure, business goals.

Months 3-4: Building Wrong Thing

They build a model that doesn't match business constraints (too slow, too expensive, uninterpretable).

Months 5-8: Debugging

Other engineers spend time debugging their work, explaining production requirements, or rebuilding from scratch.

Month 9: Exit

They leave because they're bored, or you move them off the project.

The vetting process is your only filter. And most companies don't have a good one because the ML hiring playbook hasn't been standardized. LeetCode, behavior, take-home—it's a template that worked for software engineers, not for ML.

Why VAMI Gets ML Hiring Right

This framework isn't theoretical. We use a version of it internally to vet every candidate before we present them to clients. The process is built around the same qualities that predict on-the-job performance, not interview performance.

Here's what we do differently: We focus on shipped work, not credentials. We run candidates through system design and code review before any formal interview. We ask about failures and production debugging. We reference-check for production judgment, not just "is this person nice?"

We're sharing this process because better-informed clients make better hiring decisions. If you have the right framework, you don't need us. But if you want to skip 6 months of screening and get access to engineers we've already vetted, that's what we're here for.

See Our Vetted ML Engineers

Frequently Asked Questions

Q: Why do traditional LeetCode interviews fail for ML engineers?

LeetCode tests algorithmic speed, not judgment. ML engineering is fundamentally about choosing the right model for the problem, understanding data constraints, and shipping something that works in production. An engineer who can solve a two-pointer problem in 15 minutes might spend months choosing between XGBoost and a neural network without ever validating the choice against a holdout set. LeetCode doesn't measure this core skill.

Q: What's the difference between evaluating a junior vs. senior ML engineer?

Senior ML engineers ask about data quality before suggesting a model. They've shipped enough to know that the vast majority of ML problems are data problems, not model problems. They can explain trade-offs in latency vs. accuracy. They know when a simple model is better than a complex one. Junior engineers often jump to the latest technique without this judgment. Our domain depth questions (Section 4) specifically separate these tiers.

Q: Should we use a take-home assignment for ML roles?

Yes, but structure it carefully. A take-home should be scoped to 3-4 hours, with a real (anonymized) dataset and a clear business problem. Evaluate the reasoning and iteration, not polish. Many candidates will spend 10+ hours trying to perfect an assignment—that's a red flag, not a green one. You want to see: hypothesis → experiment → learning → next step.

Q: How do we reference-check ML engineers differently?

Ask: (1) Can you give me an example of when they shipped a model to production? (2) Did they debug when accuracy didn't match expectations? (3) Did they work with non-technical stakeholders on requirements? (4) Have you seen them over-engineer a solution? The answers reveal whether they think like practitioners or academics.

Q: What if a candidate has no shipped ML projects?

Be cautious. A strong signal is a well-documented GitHub repo where they show iteration: experiments that failed, why, and what they learned. Academic papers and Kaggle competitions are weaker signals. If someone has only academic experience, probe hard during system design on whether they understand production constraints like latency, monitoring, and data drift.

Ready to Hire ML Engineers That Actually Ship?

Use this framework to vet your next hire. Or let VAMI handle the vetting and give you a short list of engineers who've already passed these tests.

Get Vetted ML Engineer Candidates

Sources & Further Reading

Papers with Code — ML research and benchmarks

How to Vet ML Engineers: A Technical Hiring Framework

TL;DR

Why Traditional ML Interviews Fail: The LeetCode Problem

The 4 Dimensions of a Strong ML Engineer

1. Modeling Skill

2. Engineering Discipline

3. Product Sense

4. Communication

The 4-Stage Vetting Framework That Works

Portfolio Review

Green Flags ✓

Red Flags ✗

System Design for ML

Green Flags ✓

Red Flags ✗

Code Review Exercise

Green Flags ✓

Red Flags ✗

Domain Depth Interview

Green Flags ✓

Red Flags ✗

Stage 1: Portfolio Review (What to Look For)

Shipped vs. Toy Projects

Evidence of Iteration

Data Exploration

Post-mortems of Failures

Stage 2: System Design for ML (Evaluate Thinking, Not the Answer)

Example Problem

Stage 3: Code Review Exercise (Give Messy Code, Ask Them to Critique)

Stage 4: Domain Depth—5 Questions That Separate Senior From Mid-Level

Green Flags vs. Red Flags Across All Stages

Universal Green Flags ✓

Universal Red Flags ✗

How to Structure a Take-Home Assignment

Reference Checks Specific to ML Roles

Why This Matters: The Cost of a Bad ML Hire

Why VAMI Gets ML Hiring Right

Frequently Asked Questions

Ready to Hire ML Engineers That Actually Ship?

Sources & Further Reading