In 2024, asking "did the candidate use AI?" was a philosophical question. In 2026, it's an operational one. Most engineering managers we talk to have stopped asking whether candidates are using AI-assisted code generation and started asking how much and whether it matters.
This guide covers what AI-generated code actually looks like, which detection signals are reliable, how to handle the grey zone, and what policy we'd recommend if we were writing your hiring guidelines today.
Why "just ban AI" doesn't work
The instinct is understandable. If you can't trust that the candidate wrote the code, the signal is tainted. So you add a line to the brief: "Do not use AI tools."
The problem:
- You can't verify compliance. A candidate who ignores the rule and uses AI anyway is no worse at writing code — they're just harder to catch.
- You're filtering out candidates who use AI effectively as a tool, which is increasingly a job requirement.
- The best candidates are the ones most likely to resent the restriction and opt out.
The better framing: you don't care that they used AI. You care whether they can produce good work and explain every line of it. Detection is about calibrating that — not about penalising tool use.
What AI-generated code actually looks like
Forget the checklist of "too perfect formatting" or "no typos." Those aren't signals at this point; modern AI code is indistinguishable from tidy human code on superficial inspection. The real signals are statistical and behavioural.
Token-level perplexity
Language models generate code by predicting the next token. The tokens they produce are, by definition, low-perplexity — high probability given the context. Human-written code has higher variance in token probability: you'll see unusual variable names, domain-specific abbreviations, personal style quirks, and decisions that surprise the model.
This is the same technique used by tools like GPTZero — applied to code rather than prose. A consistently low-perplexity score across a whole file is a reliable signal that a model generated it. A mixed score (low in the boilerplate sections, higher in the business logic) is the normal pattern for AI-assisted human writing.
Naming and structural entropy
AI models have strong priors about naming conventions. They produce userData, fetchResults, handleSubmit — common, predictable names consistent with whatever they've seen in training. Human code written under time pressure tends toward shorter, more contextual names: ud, res, doThing. Not better — but idiosyncratic.
Structural entropy is similar. AI tends to produce balanced function lengths, consistent indentation even in unusual places, and symmetric conditional branches. Human code has more variance — rushed code more so.
Neither of these is a smoking gun. A tidy, experienced engineer will produce low-entropy code naturally. The signal matters most when combined with others.
Commit history
This is the most underused signal and the hardest to fake.
A human writing code over 3–4 hours produces a commit history that looks like work: small commits, some false starts, refactors that break things before they fix them, a flurry of commits near the end. An AI-assisted submission often has one or two large commits. The timestamps compress: an hour of wall-clock time produced 600 lines across 12 files, which is not how humans type.
Look for:
- Fewer than 3 commits for a 200+ line submission
- A single large commit that touches every file at once
- Commit messages that look auto-generated ("feat: implement all requirements")
- No commits between 11pm and 3am on a weekday (candidates claiming they built it during business hours)
You don't need tooling for this — git log --stat tells you in 30 seconds. But at scale, you do want automation.
The "explain it" test
The most reliable signal of all: ask them about their code in the debrief.
Not "explain this function" (they can re-read it and explain it). Ask about decisions: "Why did you use a map here instead of a list?" or "I notice you didn't add input validation on this endpoint — was that intentional?"
A candidate who wrote the code will have opinions, even wrong ones. A candidate who prompted AI to write the code will either have no opinion or will answer with generic best-practice language that doesn't reference their actual implementation.
This is the human layer that no automated tool replaces. Use it.
How to think about the AI score
Whether you're running automated detection or scoring manually, you'll end up with something like a 0–100 likelihood score. Here's how we'd bucket it:
| Score | What it means | Action |
|---|---|---|
| 0–40 | Almost certainly human-written | No debrief adjustment needed |
| 40–60 | Human code, likely with AI assistance for boilerplate | Note it, ask one general question about tooling philosophy |
| 60–80 | Heavily AI-assisted; candidate probably wrote the architecture, AI wrote the implementation | Probe the debrief on specific decisions and edge cases |
| 80–100 | Near-total AI generation; candidate may not be able to explain key sections | Ask them to walk through a specific function live; this is your debrief focus |
The 60–80 range is where reasonable people disagree. A senior engineer who uses Copilot heavily and produces clean, correct, well-explained code is probably a better hire than a mid-level engineer who wrote everything by hand but shipped something brittle. The score is an input to the debrief question, not a verdict.
The 80+ range is where most teams set their automated rejection threshold. We'd agree with that, with one caveat: always give the candidate a chance to explain before rejecting. We've seen submissions at 85 where the candidate disclosed upfront ("I used Cursor to generate the scaffolding") and could explain every line in detail. Context matters.
What to actually do at each stage
In the brief
Add one sentence: "Submissions are automatically checked for AI-generated code. We don't automatically reject on this signal, but it informs our debrief questions."
This changes behaviour more than any technical detection does. Candidates who were planning to submit pure AI output will either opt out (fine) or increase their own involvement to be able to explain the code (also fine).
During review
Run automated detection as part of your standard review pipeline. If you're doing this manually:
git log --stat— look for the commit patterns above.- Skim the functions that "feel" too clean — run them through a perplexity checker.
- Note the sections you want to ask about in the debrief.
Don't make a hire/no-hire call at this stage based on AI detection alone. Make it in the debrief.
In the debrief
Open with an easy question about the implementation, then move to a specific decision point that requires genuine understanding:
- "I noticed you chose [approach X] for the data layer — what alternatives did you consider?"
- "This part of your code handles [edge case] but this other part doesn't — was that intentional?"
- "If you had another hour, what would you add or change?"
The last question is the most informative. A candidate who wrote the code will have a specific, prioritised answer. A candidate who prompted for the code will give a generic one.
Policy recommendation
If you're writing your AI-use policy for take-homes today, here's what we'd put in writing:
Candidates may use AI coding tools (Copilot, Cursor, Claude, etc.) as they would on the job. All submissions are automatically scanned for AI-generated code. Submissions with an AI likelihood score above 80 and where the candidate cannot explain implementation decisions in the debrief will be declined on that basis alone.
That policy:
- Doesn't ban AI use (unenforceable and alienating)
- Creates a transparent, defensible rejection criterion
- Shifts the bar from "did you write it" to "can you work with it" — which is the bar that actually matters
Frequently asked questions
Can I detect AI use if the candidate used it for a small part of the submission?
Not reliably. If a candidate used AI to generate the boilerplate and wrote the business logic themselves, the perplexity signal is mixed and the commit history will look normal. That's fine — AI-generated boilerplate is not a red flag. The detection is most accurate for submissions where AI wrote the substantive parts.
Are there false positives? What if a tidy engineer gets flagged?
Yes. Any statistical signal has false positives. A very experienced engineer who writes clean, conventional code will sometimes score in the 60–80 range. This is why the debrief is the final gate, not the score. The score is a prior; the conversation updates it.
What if the candidate used AI but was transparent about it?
Disclosure is a positive signal. A candidate who says "I used Cursor for the initial scaffolding and then refactored it" is showing tool literacy and honesty. Weight that in the debrief. They should still be able to explain every decision — but the bar for "explain it" is the same whether they used AI or not.
We're a small team. Is this worth the overhead?
If you're reviewing fewer than 10 submissions a month, the "explain it" debrief question is enough. You don't need automated detection. If you're at 10+, the overhead of manual detection adds up fast — that's where automation starts to pay off.
Does this apply to coding screens (live interviews) as well?
A live coding screen in a shared editor has a different problem: the candidate can use AI with browser tabs you can't see. The only reliable countermeasure is the debrief — ask them to extend what they just wrote, add error handling to a specific function, or explain a decision they made three minutes ago. The signal is the same: genuine work produces instant, specific answers.
CodeVerdict runs automated AI-likelihood scoring on every take-home submission as part of the standard analysis — no setup required. Try it on your next assignment.