All Projects → yuzhimanhua → HiGitClass

yuzhimanhua / HiGitClass

Licence: Apache-2.0 License
HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories (ICDM'19)

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
c
50402 projects - #5 most used programming language
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to HiGitClass

WeSHClass
[AAAI 2019] Weakly-Supervised Hierarchical Text Classification
Stars: ✭ 83 (+43.1%)
Mutual labels:  text-classification, weakly-supervised-learning, hierarchical-classification
MetaCat
Minimally Supervised Categorization of Text with Metadata (SIGIR'20)
Stars: ✭ 52 (-10.34%)
Mutual labels:  metadata, text-classification, weakly-supervised-learning
HiLAP
Code for paper "Hierarchical Text Classification with Reinforced Label Assignment" EMNLP 2019
Stars: ✭ 116 (+100%)
Mutual labels:  text-classification, hierarchical-classification
WeSTClass
[CIKM 2018] Weakly-Supervised Neural Text Classification
Stars: ✭ 67 (+15.52%)
Mutual labels:  text-classification, weakly-supervised-learning
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (-5.17%)
Mutual labels:  text-classification, weakly-supervised-learning
JavaResolver
Java class file inspection library for .NET.
Stars: ✭ 39 (-32.76%)
Mutual labels:  metadata
textgo
Text preprocessing, representation, similarity calculation, text search and classification. Let's go and play with text!
Stars: ✭ 33 (-43.1%)
Mutual labels:  text-classification
nes
Helping researchers in routine procedures for data collection
Stars: ✭ 16 (-72.41%)
Mutual labels:  metadata
appstream-generator
A fast AppStream metadata generator
Stars: ✭ 34 (-41.38%)
Mutual labels:  metadata
ASTRA
Self-training with Weak Supervision (NAACL 2021)
Stars: ✭ 127 (+118.97%)
Mutual labels:  weakly-supervised-learning
metaschema
Schema definition and validation 💡
Stars: ✭ 25 (-56.9%)
Mutual labels:  metadata
md server
Standalone EC2 metadata server to simplify the user of vendor cloud images with standalone kvm/libvirt
Stars: ✭ 36 (-37.93%)
Mutual labels:  metadata
classification
Vietnamese Text Classification
Stars: ✭ 39 (-32.76%)
Mutual labels:  text-classification
Caver
Caver: a toolkit for multilabel text classification.
Stars: ✭ 38 (-34.48%)
Mutual labels:  text-classification
plugins
Plugins for HappyPanda X
Stars: ✭ 24 (-58.62%)
Mutual labels:  metadata
TorchBlocks
A PyTorch-based toolkit for natural language processing
Stars: ✭ 85 (+46.55%)
Mutual labels:  text-classification
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-48.28%)
Mutual labels:  text-classification
just-ask
[TPAMI Special Issue on ICCV 2021 Best Papers, Oral] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Stars: ✭ 57 (-1.72%)
Mutual labels:  weakly-supervised-learning
kendraio-app
Kendraio App
Stars: ✭ 19 (-67.24%)
Mutual labels:  metadata
GAL-fWSD
Generative Adversarial Learning Towards Fast Weakly Supervised Detection
Stars: ✭ 18 (-68.97%)
Mutual labels:  weakly-supervised-learning

HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

This repository contains the source code for HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories.

Links

Installation

For training, a GPU is strongly recommended.

Keras

The code is based on Keras. You can find installation instructions here.

Dependency

The code is written in Python 3.6. The dependencies are summarized in the file requirements.txt. You can install them like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you need to first download the datasets and the embedding files. The Machine-Learning dataset (ai/) and the Bioinformatics dataset (bio/) can be downloaded here. Then you need to unzip it and put the two folders under the main folder ./. Then the following running script can be used to run the model.

./test.sh

Level-1/Level-2/Overall Micro-F1/Macro-F1 scores will be shown in the last several lines of the output. The classification result can be found under your dataset folder. For example, if you are using the Bioinformatics dataset, the output will be ./bio/out.txt.

Data

Two datasets, Machine-Learning and Bioinformatics, are used in our paper. Besides the "input" version mentioned in the Quick Start section, we also provide the json version, where each line is a json file with user, text (description + README), tags, repository name, and labels. An example is shown below.

{
  "repo": "Natsu6767/DCGAN-PyTorch",
  "user": "Natsu6767",
  "text": "pytorch implementation of dcgan trained on the celeba dataset deep convolutional gan ...",
  "tags": [
    "pytorch",
    "dcgan",
    "gan",
    "implementation",
    "deeplearning",
    "computer-vision",
    "generative-model"
  ],
  "name": [
    "DCGAN",
    "PyTorch"
  ],
  "labels": [
    "$Computer-Vision",
    "$Image-Generation"
  ]
}

NOTE: If you would like to run our code on your own dataset, when you prepare this json file, make sure you list the labels in the top-down order. For example, if the label path of your repository is ROOT-A-B-C, then the "labels" field should be ["A", "B", "C"].

Dataset statistics are as follows.

Dataset #Repositories #Classes Leaf class name
Machine-Learning 1,596 3+14 Image Generation, Object Detection, Image Classification, Semantic Segmentation, Pose Estimation, Super Resolution, Text Generation, Text Classification, Named Entity Recognition, Question Answering, Machine Translation, Language Modeling, Speech Synthesis, Speech Recognition
Bioinformatics 876 2+10 Sequence Analysis, Genome Analysis, Gene Expression, Systems Biology, Genetics and Population Analysis, Structural Bioinformatics, Phylogenetics, Text Mining, Bioimaging, Database and Ontologies

Running on New Datasets

We use ESim in the embedding module. In the Quick Start section, we include a pretrained embedding file in the downloaded folders. If you would like to retrain the embedding (or you have a new dataset), please follow the steps below.

  1. Create a directory named ${dataset} under the main folder (e.g., ./bio).

  2. Prepare three files:
    (1) ./${dataset}/label_hier.txt indicating the parent children relationships between classes. The first class of each line is the parent class, followed by all its children classes. The root class must be named as ROOT. Tab is used as the delimiter.
    (2) ./${dataset}/keywords.txt containing class-related keywords for each leaf class. Each line has a class name and a keyword.
    (3) ./${dataset}/${json-name}.json. You can refer to the provided json files for the format. All fields except "repo" are required.

  3. Install the dependencies GSL and Eigen. For Eigen, we already provide a zip file ESim/eigen-3.3.3.zip. You can directly unzip it in ESim/. For GSL, you can download it here.

  4. ./prep_emb.sh. Make sure you change the dataset/json names.

After that, you can train the classifier as mentioned in Quick Start (i.e., ./test.sh). Please always refer to the example datasets when adapting the code for a new dataset.

Citation

If you find the implementation useful, please cite the following paper:

@inproceedings{zhang2019higitclass,
  title={HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories},
  author={Zhang, Yu and Xu, Frank F. and Li, Sha and Meng, Yu and Wang, Xuan and Li, Qi and Han, Jiawei},
  booktitle={ICDM'19},
  pages={876--885},
  year={2019},
  organization={IEEE}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].