·2 min read·CodeVerdict Team

Engineering Plagiarism Detection: How we spot 'Borrowed' Code

Detecting plagiarism in source code is harder than text. Changing variable names and reordering functions easily fools simple scanners. Here's how our detection engine works.

In technical hiring, the integrity of a take-home assignment is paramount. However, as solutions to popular company assignments leak onto GitHub and forums, recruiters are increasingly facing "borrowed" code.

Spotting this manually is a nightmare. A candidate might rename all variables from camelCase to snake_case, move functions around, or change indentation—all while keeping the underlying logic identical. Simple string-matching tools fail instantly here.

The Science of Structural Similarity

To build a robust detection engine, we look past the surface level (syntax) and focus on the structure of the logic. We use a combination of several techniques:

1. Tokenization and Normalization

First, we strip away the "noise." Whitespace, comments, and variable names are irrelevant to the core logic. We convert the source code into a stream of tokens. For example, const x = 5; and let counter = 5; both become a sequence of [DECLARATION, IDENTIFIER, ASSIGNMENT, CONSTANT].

2. The Winnowing Algorithm

This is the heart of our system. Winnowing is a local fingerprinting algorithm used to detect document similarity.

Instead of hashing the entire file, we break the token stream into small overlapping windows (k-grams). We then select a "representative" subset of these hashes—the fingerprints. This allows us to detect partial matches. If a candidate copies a single complex function but writes the rest of the file themselves, we will still catch that specific function.

3. Comparison against the Corpus

When a new assignment is submitted, we compare its fingerprints against two main sources:

  • Our Template Database: We know what the starter code for your assignment looks like. We ignore matches that come from your own template.
  • Peer Submissions: We check against all previous submissions for that same assignment in your account.
  • Global Knowledge: We maintain a database of common "leak" patterns from public repositories.

Why 'Fuzzy' Matching is Necessary

Real-world plagiarism is rarely a 1:1 copy. Candidates might:

  • Add extra boilerplate.
  • Change loop types (e.g., for to while).
  • Extract logic into helper functions.

Our structural comparison is designed to be "fuzzy." We calculate a Similarity Score that represents the percentage of structural fingerprints shared between two files. A score of 90%+ is a definitive red flag, while 30-50% might indicate that two developers simply used the same standard library pattern.

Fairness and False Positives

We believe detection tools should aid human judgment, not replace it. Our reports don't just say "Plagiarized"—they show you exactly which lines of code matched and where they matched from. This allows your team to make an informed decision based on the context.

For example, if two candidates both use a very common useEffect pattern from the official React documentation, our system might flag it, but a human reviewer can easily see it's just standard practice.

Summary

Plagiarism detection in 2026 requires moving beyond text and into the realm of algorithmic fingerprinting. By using structural analysis, we ensure that your hiring process remains fair and that your team's time is spent interviewing candidates who actually wrote their code.


Curious about how your current assignments score? Upload a sample submission to see the structural analysis in action.