All Projects → AliOsm → AI-SOCO

AliOsm / AI-SOCO

Licence: MIT license
Official FIRE 2020 Authorship Identification of SOurce COde (AI-SOCO) task repository containing dataset, evaluation tools and baselines

Projects that are alternatives of or similar to AI-SOCO

references-for-dotnet-developers
Sites, blogs, cursos, redes sociais e projetos de referências para desenvolvedores .NET
Stars: ✭ 329 (+1956.25%)
Mutual labels:  source-code
Some Pentesters SecurityResearchers RedTeamers
Some Pentesters, Security Researchers, Red Teamers which i learned from them a lot...
Stars: ✭ 60 (+275%)
Mutual labels:  source-code
problem-solving
No description or website provided.
Stars: ✭ 56 (+250%)
Mutual labels:  codeforces
CP
Competitive Coding
Stars: ✭ 25 (+56.25%)
Mutual labels:  codeforces
codeforces2pdf
Light tool to extract CodeForces problems into PDF files
Stars: ✭ 18 (+12.5%)
Mutual labels:  codeforces
debug-react-source-code
搭建阅读React源码调试环境,支持所有React版本细分文件断点调试。当前最新React版本:18.1.0。 Create an environment for reading and debugging react source code, support debugging breakpoints subdivision files of all react versions. Latest version: 18.1.0.
Stars: ✭ 144 (+800%)
Mutual labels:  source-code
Codeforces-Solution
No description or website provided.
Stars: ✭ 94 (+487.5%)
Mutual labels:  codeforces
winprint
winprint 2.0 - Advanced source code and text file printing. The perfect tool for printing source code, web pages, reports generated by legacy systems, documentation, or any text or HTML file. It works interactively or from the command line making it great for single users or whole enterprises. Works great with Powershell.
Stars: ✭ 52 (+225%)
Mutual labels:  source-code
algo
🧠 Algorithms and Data structures
Stars: ✭ 17 (+6.25%)
Mutual labels:  codeforces
Competitive-Programming
😘Competitive Programming Source Code (OnlineJudge , ICPC , CCPC, Codeforces , Topcoder ,Google Code Jam... etc
Stars: ✭ 45 (+181.25%)
Mutual labels:  codeforces
CodeCheck
✔️Implementation of Hackerrrank using django
Stars: ✭ 26 (+62.5%)
Mutual labels:  codeforces
competitive-programming
This is my collection of various algorithms and data structures that I feel that are needed frequently in competitive programming .
Stars: ✭ 30 (+87.5%)
Mutual labels:  codeforces
algovault
Algorithms and templates for competitive programming
Stars: ✭ 67 (+318.75%)
Mutual labels:  codeforces
vjudge-to-oj
Import your vJudge solutions to actual online judges. Currently supports UVa, CodeForces, SPOJ, and CodeForces GYM.
Stars: ✭ 43 (+168.75%)
Mutual labels:  codeforces
Freemium-Music-App-Src
⏩ Complete Source code of Freemium Music App
Stars: ✭ 31 (+93.75%)
Mutual labels:  source-code
GpuZen2
Sample code for the article 'Real-Time Layered Materials Compositing Using Spatial Clustering Encoding'
Stars: ✭ 17 (+6.25%)
Mutual labels:  source-code
codeforces-upsolving-helper
A web app developed using Flask that compiles all the Problems on Codeforces that you have attempted (submitted at least once) but could not get Accepted verdict. Recommended Problems are also shown.
Stars: ✭ 61 (+281.25%)
Mutual labels:  codeforces
Competitive-Programming--Solution
This ia an public repository for Accepted solution of coding problems on different coding plateforms like codeforces , hackerearth, codechef , hackerrank .......
Stars: ✭ 24 (+50%)
Mutual labels:  codeforces
A2OJ-Enhancer
Chrome extension to enhance the functionality of static A2OJ site.
Stars: ✭ 36 (+125%)
Mutual labels:  codeforces
NeuralCodeTranslator
Neural Code Translator provides instructions, datasets, and a deep learning infrastructure (based on seq2seq) that aims at learning code transformations
Stars: ✭ 32 (+100%)
Mutual labels:  source-code

Logo

AI-SOCO

Official FIRE 2020 Authorship Identification of SOurce COde (AI-SOCO) PAN task repository containing dataset, evaluation tools and baselines.

10 - 13 December, Virtually.

Welcome to pariticipate on our Codalab competition here!

All participants are welcome to open new issue about dataset issues!

Introduction

General authorship identification is essential to the detection of undesirable deception of others' content misuse or exposing the owners of some anonymous hurtful content. This is done by revealing the author of that content. Authorship Identification of SOurce COde (AI-SOCO) focuses on uncovering the author who wrote some piece of code. This facilitates solving issues related to cheating in academic, work and open source environments. Also, it can be helpful in detecting the authors of malware softwares over the world.

The detection of cheating in academic communities is significant to properly address the contribution of each researcher. Also, in work environments, credit sometimes goes to people that did not deserve it. Such issues of plagiarism could arise in open source projects that are available on public platforms. Similarly, this could be used in public or private online coding contests whether done in coding interviews or in official coding training contests to detect the cheating of applicants or contestants. A system like this could also play a big role in detecting the source of anonymous malicious softwares.

The dataset is composed of source codes collected from the open submissions in the Codeforces online judge. Codeforces is an online judge for hosting competitive programming contests such that each contest consists of multiple problems to be solved by the participants. A Codeforces participant can solve a problem by writing a solution for it using any of the available programming languages on the website, and then submitting the solution through the website. The solution's result can be correct (accepted) or incorrect (wrong answer, time limit exceeded, etc.).

In our dataset, we selected 1,000 users and collected 100 source codes from each one. So, the total number of source codes is 100,000. All collected source codes are correct, bug-free, compile-ready and written using the C++ programming language using different versions. For each user, all collected source codes are from unique problems.

Given the pre-defined set of source codes and their writers, the task is to build a system that is able to detect the writer given any new, unseen before source codes from the previously defined writers list.

Example

Given the following bug-free and ready to compile C++ source code:

#include <string>
#include <iostream>
#include <ctype.h>
using namespace std;
 
int main() {
    string s;
    cin >> s;
    s[0] = toupper(s[0]);
    cout << s << endl;
    return 0;
}

You need to build a system that can determine the source code writer from list consists of 1,000 writers.

Dataset Structure

In data_dir directory there are the following:

  • train.csv file which contains 50K pairs of uids (User IDs) and pids (Problem IDs). Each uid appears 50 times in the file with 50 different pids.
  • train directory which contains 50K files, each file with different pid represents the C++ source code that will be the input to your system.
  • dev.csv file is similar to train.csv, but it will be used to evaluation your system while developing, so it is not allowed to use it in the training phase.
  • dev directory is similar to train, but it will be used to evaluation your system while developing, so it is not allowed to use it in the training phase.
  • unlabeled_test.csv file is similar to train.csv, but it will be used to evaluation your system, so it is not allowed to use it in the training phase.
  • test directory is similar to train, but it will be used to evaluation your system, so it is not allowed to use it in the training phase.

Note

The data is now available on Zenodo with the test set labels.

Baseline

  • Random Baseline is simply predicting a random writer for each piece of code from the list of 1,000 writers (from 0 to 999). Its accuracy reaches around 0.1%.
  • Characters Count Logistic Baseline converts each source code to a vector represents the count of the 100 printable characters, then it builds a logistic regression model on the vectorized representations. It achieved a 29.252% accuracy on the development set.
  • TF-IDF KNN Baseline vectorizes the source codes using TF-IDF method with 10K features and builds a KNN classifier with 25 neighbors on top of that representations extracted from TF-IDF. Its accuracy on the development set is 62.128% which is much better than the previous baselines. Keep in mind that this baseline is very slow and it will take you about 4 hours to predict all examples in the development set using 6 threads.

To train and predict on the development set using any of the previously mentioned baselines, please run the following command:

python baselines/[random_baseline.py|characters_logistic_baseline.py|tfidf_knn_baseline.py]

Evaluation

Systems will be evaluated and ranked based on Accuracy metric. An evaluation script is available on the Github repository.

Important Dates

  • 8th June - Open track website
  • 8th June – Training and development data release
  • 31st July – Test data release
  • 7th September – Run submission deadline
  • 15th September – Results declared
  • 5th October – Working notes papers due
  • 10th November – Final version of working notes papers due
  • 16th-20th December - FIRE 2020 (Online Event)

Notes

  • All scripts in this repository were tested on Ubuntu 20.04 and Python 3.8.2.

License

The dataset is distributed under the MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].