In technical hiring, the integrity of a take-home assignment is paramount. However, as solutions to popular company assignments leak onto GitHub and forums, recruiters are increasingly facing "borrowed" code.
Spotting this manually is a nightmare. A candidate might rename all variables from camelCase to snake_case, move functions around, or change indentation—all while keeping the underlying logic identical. Simple string-matching tools fail instantly here.
The Science of Structural Similarity
To build a robust detection engine, we look past the surface level (syntax) and focus on the structure of the logic. We use a combination of several techniques:
1. Tokenization and Normalization
First, we strip away the "noise." Whitespace, comments, and variable names are irrelevant to the core logic. We convert the source code into a stream of tokens. For example, const x = 5; and let counter = 5; both become a sequence of [DECLARATION, IDENTIFIER, ASSIGNMENT, CONSTANT].
2. The Winnowing Algorithm
This is the heart of our system. Winnowing is a local fingerprinting algorithm used to detect document similarity.
Instead of hashing the entire file, we break the token stream into small overlapping windows (k-grams). We then select a "representative" subset of these hashes—the fingerprints. This allows us to detect partial matches. If a candidate copies a single complex function but writes the rest of the file themselves, we will still catch that specific function.
3. Comparison against the Corpus
When a new assignment is submitted, we compare its fingerprints against two main sources:
- Our Template Database: We know what the starter code for your assignment looks like. We ignore matches that come from your own template.
- Peer Submissions: We check against all previous submissions for that same assignment in your account.
- Global Knowledge: We maintain a database of common "leak" patterns from public repositories.
Why 'Fuzzy' Matching is Necessary
Real-world plagiarism is rarely a 1:1 copy. Candidates might:
- Add extra boilerplate.
- Change loop types (e.g.,
fortowhile). - Extract logic into helper functions.
Our structural comparison is designed to be "fuzzy." We calculate a Similarity Score that represents the percentage of structural fingerprints shared between two files. A score of 90%+ is a definitive red flag, while 30-50% might indicate that two developers simply used the same standard library pattern.
Fairness and False Positives
We believe detection tools should aid human judgment, not replace it. Our reports don't just say "Plagiarized"—they show you exactly which lines of code matched and where they matched from. This allows your team to make an informed decision based on the context.
For example, if two candidates both use a very common useEffect pattern from the official React documentation, our system might flag it, but a human reviewer can easily see it's just standard practice.
Summary
Plagiarism detection in 2026 requires moving beyond text and into the realm of algorithmic fingerprinting. By using structural analysis, we ensure that your hiring process remains fair and that your team's time is spent interviewing candidates who actually wrote their code.
Curious about how your current assignments score? Upload a sample submission to see the structural analysis in action.