Do you detect only exact similarity? What if variable names, formatting is chang...

nunobrito · on Feb 10, 2016

Different algorithms are used.

1) binary comparison. Without knowing what type of file we are matching, we compare to other files and evaluate if the binary contents are similar (or preferably 100% equal)

2) snippet matching. For mainstream languages (C, Java, Javascript, Python, etc) we transform the code into anonymized blocks that don't care about variable names, formatting or comment blocks. Then the code is compared for similarity. Up to 80% similarity is still qualifying as a match.

To provide context, we have the concept of code diversity. Meaning that a given match needs to present a relatively high number of different logical instructions in order to qualify as match. Example, multiple IF statements will not qualify, unless they contain other code within. If you change the order, add/remove code we are still robust enough to detect changes.

For special cases where exists known malicious intention of hiding the code I will be cross-matching different algorithms and specifically look on variable names and comments inside the code. In such cases, a manual inspection gets done by an expert and becomes truly difficult for a developer to escape the detection of non-original code.

In fact, if the guy is indeed able to hide code from triplecheck then it reached a level of sophistication that no normal third-party developer will be capable of (easily) detecting plagiarism. In our experience have occurred rare cases where only with new techniques we notice that a given company managed to hide non-original code from our tooling.

In either case, we live and learn from such examples and gets more difficult on new iterations of the tooling to evade (non) originality detection.