All Projects → microsoft → near-duplicate-code-detector

microsoft / near-duplicate-code-detector

Licence: MIT License
A simple tool for detecting near-duplicate source code

Programming Languages

C#
18002 projects
java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
F#
602 projects

Projects that are alternatives of or similar to near-duplicate-code-detector

dups
A CLI tool to find/remove duplicate files supporting multi-core and different algorithms (MD5, SHA256, and XXHash).
Stars: ✭ 21 (-62.5%)
Mutual labels:  duplicates
nodups
No dups, no doubts
Stars: ✭ 14 (-75%)
Mutual labels:  duplicates
indexed-string-variation
Experimental JavaScript module to generate all possible variations of strings over an alphabet using an n-ary virtual tree
Stars: ✭ 16 (-71.43%)
Mutual labels:  duplicates
apollo
Advanced similarity and duplicate source code proof of concept for our research efforts.
Stars: ✭ 49 (-12.5%)
Mutual labels:  duplicates
finddups
Find duplicate files on your computer
Stars: ✭ 22 (-60.71%)
Mutual labels:  duplicates
mediadc
Nextcloud Media Duplicate Collector application
Stars: ✭ 57 (+1.79%)
Mutual labels:  duplicates
removedupes
Remove Duplicate Messages
Stars: ✭ 52 (-7.14%)
Mutual labels:  duplicates
Jscpd
Copy/paste detector for programming source code.
Stars: ✭ 2,397 (+4180.36%)
Mutual labels:  duplicates
Czkawka
Multi functional app to find duplicates, empty folders, similar images etc.
Stars: ✭ 5,360 (+9471.43%)
Mutual labels:  duplicates

Near-Duplicate Code Detector

This cross-platform sample tool detects exact and near duplicates of code maintained by the Deep Program Understanding group in Microsoft Research, Cambridge, UK. It has been created for the purpose of deduplicating code corpora for research purposes.

Requirements: .NET Core 2.1 or higher. For parsing code, an appropriate runtime for each of the languages that needs to be tokenized is also required.

To run the near-duplicate detection run:

$ dotnet run /path/to/DuplicateCodeDetector.csproj [options] --dir=<folder> <output-file-prefix>

This will use all the .gz files in the <folder> and output an <output-file-prefix>.json with the groups of detected duplicates. Invoke --help for more options.

Input Data

The input data should be one or more .jsonl.gz files. These are compressed JSONL files where each line has a single JSON entry of the form

{
    "filename": "unique identifier of file, such as a path or a unique id",
    "tokens" : ["list", "of", "tokens", "in", "file"]
}

Alternative formats can be accepted by providing the --tokens-field and --id-fields options.

The tokenizers folder in this repository contains tokenizers for C#,F#, Java, JavaScript and Python. Please, feel free to contribute tokenizers for other languages too.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].