All Projects → WING-NUS → JD2Skills-BERT-XMLC

WING-NUS / JD2Skills-BERT-XMLC

Licence: MIT license
Code and Dataset for the Bhola et al. (2020) Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to JD2Skills-BERT-XMLC

Recommendation-System-Baseline
Some common recommendation system baseline, with description and link.
Stars: ✭ 34 (+3.03%)
Mutual labels:  recommendation-system
BERT-for-Chinese-Question-Answering
No description or website provided.
Stars: ✭ 75 (+127.27%)
Mutual labels:  bert
listenbrainz-labs
A collection tools/scripts to explore the ListenBrainz data using Apache Spark.
Stars: ✭ 16 (-51.52%)
Mutual labels:  recommendation-system
classifier multi label seq2seq attention
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search
Stars: ✭ 26 (-21.21%)
Mutual labels:  bert
Long-Tail-GAN
Adversarial learning framework to enhance long-tail recommendation in Neural Collaborative Filtering
Stars: ✭ 19 (-42.42%)
Mutual labels:  recommendation-system
bert-tensorflow-pytorch-spacy-conversion
Instructions for how to convert a BERT Tensorflow model to work with HuggingFace's pytorch-transformers, and spaCy. This walk-through uses DeepPavlov's RuBERT as example.
Stars: ✭ 26 (-21.21%)
Mutual labels:  bert
awesome-graph-self-supervised-learning-based-recommendation
A curated list of awesome graph & self-supervised-learning-based recommendation.
Stars: ✭ 37 (+12.12%)
Mutual labels:  recommendation-system
mirror-bert
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.
Stars: ✭ 56 (+69.7%)
Mutual labels:  bert
Recommendation-system
推荐系统资料笔记收录/ Everything about Recommendation System. 专题/书籍/论文/产品/Demo
Stars: ✭ 169 (+412.12%)
Mutual labels:  recommendation-system
task-transferability
Data and code for our paper "Exploring and Predicting Transferability across NLP Tasks", to appear at EMNLP 2020.
Stars: ✭ 35 (+6.06%)
Mutual labels:  bert
classifier multi label
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification
Stars: ✭ 127 (+284.85%)
Mutual labels:  bert
Tf-Rec
Tf-Rec is a python💻 package for building⚒ Recommender Systems. It is built on top of Keras and Tensorflow 2 to utilize GPU Acceleration during training.
Stars: ✭ 18 (-45.45%)
Mutual labels:  recommendation-system
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-48.48%)
Mutual labels:  bert
BERT4Rec-VAE-Pytorch
Pytorch implementation of BERT4Rec and Netflix VAE.
Stars: ✭ 212 (+542.42%)
Mutual labels:  recommendation-system
bert-squeeze
🛠️ Tools for Transformers compression using PyTorch Lightning ⚡
Stars: ✭ 56 (+69.7%)
Mutual labels:  bert
label-studio-transformers
Label data using HuggingFace's transformers and automatically get a prediction service
Stars: ✭ 117 (+254.55%)
Mutual labels:  bert
trove
Weakly supervised medical named entity classification
Stars: ✭ 55 (+66.67%)
Mutual labels:  bert
COVID-19-Tweet-Classification-using-Roberta-and-Bert-Simple-Transformers
Rank 1 / 216
Stars: ✭ 24 (-27.27%)
Mutual labels:  bert
BERT-embedding
A simple wrapper class for extracting features(embedding) and comparing them using BERT in TensorFlow
Stars: ✭ 24 (-27.27%)
Mutual labels:  bert
rasa milktea chatbot
Chatbot with bert chinese model, base on rasa framework(中文聊天机器人,结合bert意图分析,基于rasa框架)
Stars: ✭ 97 (+193.94%)
Mutual labels:  bert

JD2Skills-BERT-XMLC

Dataset | Paper | PPT | Presentation

Code and Dataset for the Bhola et al. (2020) Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework

Default model weights and dataset are available in the link

Dataset

The dataset is collected from Singaporean government website, mycareersfuture.sg consisting of over 20, 000 richly-structured job posts. The detailed statistics of the dataset are shown below:

Mycareersfuture.sg Dataset Stats
Number of job posts 20,298
Number of distinct skills 2,548
Number of skills with 20 or more mentions 1,209
Average skill tags per job post 19.98
Average token count per job post 162.27
Maximum token count in a job post 1,127

This dataset includes the following fields:

  1. company_name
  2. job_title
  3. employment_type
  4. seniority
  5. job_category
  6. location
  7. salary
  8. min_experience
  9. skills_required
  10. requirements_and_role
  11. job_requirements
  12. company_info
  13. posting_date
  14. expiry_date
  15. no_of_applications
  16. job_id

BERT-XMLC model

The proposed model constitutes a pre-trained BERT based text encoder utilizing WordPiece embedding. The encoded textual representation is passed into bottleneck layer. This layer alleviates overfitting by (significantly) limiting the number of trainable parameters. The activations are passed through fully connected layer, finally producing probability scores using sigmoid activation function.

Model setup

  • Run bash setup.sh

Or

  • Transfer all files from checkpoint folder (in google drive) to pybert/pretrain/bert/bert-uncased folder
  • Transfer dataset files from dataset folder (in google drive) to pybert/dataset folder

Training
python run_bert.py --train --data_name job_dataset

Testing
python run_bert.py --test --data_name job_dataset

Note: Configurations for training, validation and testing of model are provided in pybert/configs/basic_config.py
Additionally, pybert/model_setup/CAB_dataset_script.py is provided to implement CAB

Docker image deployment
To create a docker deployment:

  • First download & setup all the datasets files and pre-trained model weights in the pybert dir
  • Run docker build . to create docker image
  • To start training with default config, run docker run <docker_image_name>
  • For testing and further operations, execute commands in /bin/bash terminal in the container docker run -it <docker_image_name> /bin/bash

Results

Experimental results on skill prediction task are shown below:

Note: Model has been further finetuned

Bibtex

@inproceedings{bhola-etal-2020-retrieving,
    title = "Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework",
    author = "Bhola, Akshay  and
      Halder, Kishaloy  and
      Prasad, Animesh  and
      Kan, Min-Yen",
    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "International Committee on Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.coling-main.513",
    doi = "10.18653/v1/2020.coling-main.513",
    pages = "5832--5842",
    abstract = "We introduce a deep learning model to learn the set of enumerated job skills associated with a job description. In our analysis of a large-scale government job portal mycareersfuture.sg, we observe that as much as 65{\%} of job descriptions miss describing a significant number of relevant skills. Our model addresses this task from the perspective of an extreme multi-label classification (XMLC) problem, where descriptions are the evidence for the binary relevance of thousands of individual skills. Building upon the current state-of-the-art language modeling approaches such as BERT, we show our XMLC method improves on an existing baseline solution by over 9{\%} and 7{\%} absolute improvements in terms of recall and normalized discounted cumulative gain. We further show that our approach effectively addresses the missing skills problem, and helps in recovering relevant skills that were missed out in the job postings by taking into account the structured semantic representation of skills and their co-occurrences through a Correlation Aware Bootstrapping process. We further show that our approach, to ensure the BERT-XMLC model accounts for structured semantic representation of skills and their co-occurrences through a Correlation Aware Bootstrapping process, effectively addresses the missing skills problem, and helps in recovering relevant skills that were missed out in the job postings. To facilitate future research and replication of our work, we have made the dataset and the implementation of our model publicly available.",
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].