All Projects → yuzhimanhua → MetaCat

yuzhimanhua / MetaCat

Licence: Apache-2.0 license
Minimally Supervised Categorization of Text with Metadata (SIGIR'20)

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language

Projects that are alternatives of or similar to MetaCat

HiGitClass
HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories (ICDM'19)
Stars: ✭ 58 (+11.54%)
Mutual labels:  metadata, text-classification, weakly-supervised-learning
WeSTClass
[CIKM 2018] Weakly-Supervised Neural Text Classification
Stars: ✭ 67 (+28.85%)
Mutual labels:  text-classification, weakly-supervised-learning
WeSHClass
[AAAI 2019] Weakly-Supervised Hierarchical Text Classification
Stars: ✭ 83 (+59.62%)
Mutual labels:  text-classification, weakly-supervised-learning
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (+5.77%)
Mutual labels:  text-classification, weakly-supervised-learning
VideoMetadataProvider
Video metadata provider library (collect metadata from ExoPlayer, FFMpeg, Native Android)
Stars: ✭ 20 (-61.54%)
Mutual labels:  metadata
rules
One Framework to build a highly declarative and customizable UI without using templates.
Stars: ✭ 38 (-26.92%)
Mutual labels:  metadata
Valour
An open source chat client for freedom
Stars: ✭ 52 (+0%)
Mutual labels:  metadata
automatic-personality-prediction
[AAAI 2020] Modeling Personality with Attentive Networks and Contextual Embeddings
Stars: ✭ 43 (-17.31%)
Mutual labels:  text-classification
Manga-Tagger
The only tool you'll need to rename and write metadata to your digital manga library
Stars: ✭ 110 (+111.54%)
Mutual labels:  metadata
HiGRUs
Implementation of the paper "Hierarchical GRU for Utterance-level Emotion Recognition" in NAACL-2019.
Stars: ✭ 60 (+15.38%)
Mutual labels:  text-classification
nlp classification
Implementing nlp papers relevant to classification with PyTorch, gluonnlp
Stars: ✭ 224 (+330.77%)
Mutual labels:  text-classification
LLVM-Metadata-Visualizer
LLVM Metadata Visualizer
Stars: ✭ 20 (-61.54%)
Mutual labels:  metadata
dataspice
🌶️ Create lightweight schema.org descriptions of your datasets
Stars: ✭ 151 (+190.38%)
Mutual labels:  metadata
Naive-Bayes-Text-Classifier-in-Java
Naive Bayes Classification used to classify movie reviews as positive or negative
Stars: ✭ 18 (-65.38%)
Mutual labels:  text-classification
small-text
Active Learning for Text Classification in Python
Stars: ✭ 241 (+363.46%)
Mutual labels:  text-classification
Awesome-Weak-Shot-Learning
A curated list of papers, code and resources pertaining to weak-shot classification, detection, and segmentation.
Stars: ✭ 142 (+173.08%)
Mutual labels:  weakly-supervised-learning
audio-metadata
A library for reading and, in the future, writing audio metadata. https://audio-metadata.readthedocs.io/
Stars: ✭ 41 (-21.15%)
Mutual labels:  metadata
epubtool
A tool to manipulate ePub files.
Stars: ✭ 17 (-67.31%)
Mutual labels:  metadata
langauge
🎨 Stylize your readme files with colorful gauges
Stars: ✭ 16 (-69.23%)
Mutual labels:  metadata
monkeylearn-php
Official PHP client for the MonkeyLearn API. Build and consume machine learning models for language processing from your PHP apps.
Stars: ✭ 47 (-9.62%)
Mutual labels:  text-classification

Minimally Supervised Categorization of Text with Metadata

This repository contains the source code for Minimally Supervised Categorization of Text with Metadata.

Links

Installation

For training, a GPU is highly recommended.

Keras

The code is based on the Keras library. You can find installation instructions here.

Dependency

The code is written in Python 3.6. The dependencies are summarized in the file requirements.txt. You can install them like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you need to first download the datasets. Five datasets are used in our paper. The GitHub-Sec dataset, unfortunately, cannot be published due to our commitment to the data provider. The other four datasets are available. Once you unzip the downloaded file, you can see four folders corresponding to these four datasets, respectively.

Dataset Folder Name #Documents #Classes Class name (#Repositories in this class)
GitHub-Bio bio/ 876 10 Sequence Analysis (210), Genome Analysis (176), Gene Expression (63), Systems Biology (53), Genetics (47), Structural Bioinformatics (39), Phylogenetics (27), Text Mining (63), Bioimaging (125), Database and Ontologies (73)
GitHub-AI ai/ 1,596 14 Image Generation (215), Object Detection (296), Image Classification (361), Semantic Segmentation (170), Pose Estimation (96), Super Resolution (75), Text Generation (24), Text Classification (26), Named Entity Recognition (22), Question Answering (102), Machine Translation (117), Language Modeling (44), Speech Synthesis (27), Speech Recognition (21)
Amazon amazon/ 100,000 10 Apps for Android (10,000), Books (10,000), CDs and Vinyl (10,000), Clothing, Shoes and Jewelry (10,000), Electronics (10,000), Health and Personal Care (10,000), Home and Kitchen (10,000), Movies and TV (10,000), Sports and Outdoors (10,000), Video Games (10,000)
Twitter twitter/ 135,529 (sorry for the typo in the original paper) 9 Food (34,387), Shop and Service (13,730), Travel and Transport (8,826), College and University (2,281), Nightlife Spot (15,082), Residence (1,678), Outdoors and Recreation (19,488), Arts and Entertainment (26,274), Professional Places (13,783)

You need to put these 4 dataset folders under the repository main folder ./. Then the following running script can be used to run the model.

./test.sh

Micro-F1, Macro-F1 and the confusion matrix will be shown in the last several lines of the output. The classification result can be found under your dataset folder. For example, if you are using the GitHub-Bio dataset, the output will be ./bio/out.txt.

Data

Besides the "input" version mentioned in the Quick Start section, we also provide the json version, where each line is a json file containing text and metadata (e.g., user, tags and product).

For GitHub-Bio, GitHub-AI, and Twitter, the json format is as follows:

{
  "user": [
    "Natsu6767"
  ],
  "text": "pytorch implementation of dcgan trained on the celeba dataset ...",
  "tags": [
    "pytorch",
    "dcgan",
    "gan",
    "implementation",
    "deeplearning",
    "computer-vision",
    "generative-model"
  ],
  "label": 0,
  "label_name": "$Image-Generation"
}

Here, the "user" field is global metadata; the "tags" field is local metadata. (Please refer to our paper for the definitions of global and local metadata.)

For Amazon, the json format is as follows:

{
  "user": [
    "A1N4O8VOJZTDVB"
  ],
  "text": "really cute loves the song so he really could ...",
  "product": [
    "B004A9SDD8"
  ],
  "label": 0,
  "label_name": "Apps_for_Android"
}

Here, both "user" and "product" are global metadata; there is no local metadata in the Amazon dataset.

NOTE: If you would like to run our code on your own dataset, when you prepare this json file, make sure: (1) For each document, its metadata field is always represented by a list of strings. For example, the "user" field should be ["A1N4O8VOJZTDVB"] instead of "A1N4O8VOJZTDVB". (2) The "label" field is an integer. If you have N classes, the label space should be 0, 1, ..., N-1. The "label_name" field is optional.

Running on New Datasets

In the Quick Start section, we include a pretrained embedding file in the downloaded folders. If you have a new dataset, you need to rerun our generation-guided embedding module to get your own embedding files. Please follow the steps below.

  1. Create a directory named ${dataset} under the main folder (e.g., ./bio).

  2. Prepare three files:

(1) ./${dataset}/doc_id.txt containing labeled document ids for each class. Each line begins with the class id (starting from 0), followed by a colon, and then document ids in the corpus (starting from 0) of the corresponding class separated by commas.

(2) ./${dataset}/dataset.json. You can refer to the provided json files for the format. Make sure it has two fields "text" and "label". ("label" should be an integer in 0, 1, ..., N-1, corresponding to the classes in ./${dataset}/doc_id.txt.) You can add your own metadata fields in the json.

(3) ./${dataset}/meta_dict.json indicating the names of your global/local metadata fields. For example, for GitHub-Bio, GitHub-AI, and Twitter, it should be

{"global": ["user"], "local": ["tags"]}

For Amazon, it should be

{"global": ["user", "product"], "local": []}
  1. ./prep_emb.sh. Make sure you have changed the dataset name. The embedding file will be saved to ./${dataset}/embedding_gge.

With the embedding file, you can train the classifier as mentioned in Quick Start (i.e., ./test.sh). Please always refer to the example datasets when adapting the code for a new dataset.

Citation

If you find the implementation useful, please cite the following paper:

@inproceedings{zhang2020minimally,
  title={Minimally Supervised Categorization of Text with Metadata},
  author={Zhang, Yu and Meng, Yu and Huang, Jiaxin and Xu, Frank F. and Wang, Xuan and Han, Jiawei},
  booktitle={SIGIR'20},
  pages={1231--1240},
  year={2020},
  organization={ACM}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].