All Projects → yuewang-cuhk → TAKG

yuewang-cuhk / TAKG

Licence: MIT License
The official implementation of ACL 2019 paper "Topic-Aware Neural Keyphrase Generation for Social Media Language"

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to TAKG

LinkedIn Scraper
🙋 A Selenium based automated program that scrapes profiles data,stores in CSV,follows them and saves their profile in PDF.
Stars: ✭ 25 (-80.31%)
Mutual labels:  social-media, topic-modeling
Learning Social Media Analytics With R
This repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt
Stars: ✭ 102 (-19.69%)
Mutual labels:  social-media, topic-modeling
gensimr
📝 Topic Modeling for Humans
Stars: ✭ 35 (-72.44%)
Mutual labels:  topic-modeling
Classification-of-Depression-on-Social-Media-Using-Text-Mining
The first asian machine learning in Jeju Island, South Korea - Project
Stars: ✭ 57 (-55.12%)
Mutual labels:  social-media
sociallink
Alignments between knowledge bases and social media
Stars: ✭ 16 (-87.4%)
Mutual labels:  social-media
Spotify
[READ ONLY] Subtree split of the SocialiteProviders/Spotify Provider (see SocialiteProviders/Providers)
Stars: ✭ 13 (-89.76%)
Mutual labels:  social-media
Pepaverse
Pepaverse is an open source social network build with nodejs, mongoDB, passportjs and socket.io
Stars: ✭ 16 (-87.4%)
Mutual labels:  social-media
KGE-LDA
Knowledge Graph Embedding LDA. AAAI 2017
Stars: ✭ 35 (-72.44%)
Mutual labels:  topic-modeling
foodie
A social media for food lovers and for people looking for new ideas for their next menu.
Stars: ✭ 30 (-76.38%)
Mutual labels:  social-media
v chat sdk
official sdk for v chat this is a complete chat ecosystem use flutter for clint node js and socket io for server side flutter chat v chat sdk and flutter group chat
Stars: ✭ 25 (-80.31%)
Mutual labels:  social-media
gobo
💭 Gobo: Your social media. Your rules.
Stars: ✭ 87 (-31.5%)
Mutual labels:  social-media
Twitter-Trends
Twitter Trends is a web-based application that automatically detects and analyzes emerging topics in real time through hashtags and user mentions in tweets. Twitter being the major microblogging service is a reliable source for trends detection. The project involved extracting live streaming tweets, processing them to find top hashtags and user …
Stars: ✭ 82 (-35.43%)
Mutual labels:  topic-modeling
tassal
Tree-based Autofolding Software Summarization Algorithm
Stars: ✭ 38 (-70.08%)
Mutual labels:  topic-modeling
Hacker
SOCIAL MEDIA PHISHING TOOL
Stars: ✭ 36 (-71.65%)
Mutual labels:  social-media
learning-stm
Learning structural topic modeling using the stm R package.
Stars: ✭ 103 (-18.9%)
Mutual labels:  topic-modeling
Flutter-Photoarc-app
(Full-stack) Fully functional social media app (Instagram clone) written in flutter and dart with backend node.js and Postgres SQL.
Stars: ✭ 38 (-70.08%)
Mutual labels:  social-media
squeaknode
Peer-to-peer status feed 📜 with posts unlocked by Lightning ⚡
Stars: ✭ 29 (-77.17%)
Mutual labels:  social-media
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-76.38%)
Mutual labels:  topic-modeling
watchman
Watchman: An open-source social-media event-detection system
Stars: ✭ 18 (-85.83%)
Mutual labels:  social-media
NMFADMM
A sparsity aware implementation of "Alternating Direction Method of Multipliers for Non-Negative Matrix Factorization with the Beta-Divergence" (ICASSP 2014).
Stars: ✭ 39 (-69.29%)
Mutual labels:  topic-modeling

TAKG

The official implementation of ACL 2019 paper "Topic-Aware Neural Keyphrase Generation for Social Media Language" (TAKG). This is a joint work with NLP Center at Tencent AI Lab. Some scripts for drawing figures in the paper can be found here.

Dataset

Due to the copyright issue of TREC 2011 Twitter dataset, we only release the Weibo dataset (in data/Weibo) and StackExchange dataset (in data/StackExchange). For more details about the Twitter dataset, please contact Yue Wang or Jing Li.

Data format

  • The dataset is randomly splited into three segments (80% training, 10% validation, 10% testing).
  • For each segment (train/valid/test), we have the source posts (stored in ".*_src.txt") and target keyphrases (stored in ".*_trg.txt"). One line is for each instance.
  • For multiple keyphrases for one post, keyphrases are seperated by a semicolon ";".

Data statistics

We first show the statistics of the source posts. (KP refers to keyphrase)

Datasets # of posts Avg len of posts # of KP per post Source Vocabulary
Twitter 44,113 19.52 1.13 34,010
Weibo 46,296 33.07 1.06 98,310
StackExchange 49,447 87.94 2.43 99,775

We then show the detailed statistics of keyphrases:

Datasets Size of KP Avg len of KP % of absent KP Target Vocabulary
Twitter 4,347 1.92 71.35 4,171
Weibo 2,136 2.55 75.74 2,833
StackExchange 12,114 1.41 54.32 10,852

Model

Our model allows joint modeling of latent topics and keyphrase generation. It consists of a neural topic model to induce the latent topics and a neural seq2seq-based generation model to produce keyphrases. The overall architecture is depicted below:

The overall architecture

Code

Here we give some representative commands illustrating how to preprocess data, train, test, and evaluate our model. For more detailed configurations, please refer to config.py. In train.py and predict.py, I hard code some default arguments in the function process_opt(opt) to simplify each running command. We also provide our model's sample predictions for the three datasets in my_sample_prediction.

This code is mainly adapted from KenChan's keyphrase generation code and Zengjichuan's TMN. Thank my colleagues (Ken and Zeng) very much for their support.

Dependencies

  • Python 3.5+
  • Pytorch 0.4

Prepocessing

To preprocess the source data, run: python preprocess.py -data_dir data/Weibo

It will output the processed data to the folder processed_data. The filename of the processed data will be used as the argument data_tag in the following process.

Some common arguments are listed below:

data_dir: The source file of the data
vocab_size: Size of the source vocabulary
bow_vocab: Size of the bow dictionary
max_src_len: Max length of the source sequence
max_trg_len: Max length of the target sequence

Training

Only train the seq2seq model, run: python train.py -data_tag Weibo_s100_t10. To train with copy mechanism, add -copy_attention.

Only train the neural topic model, run: python train.py -data_tag Weibo_s100_t10 -only_train_ntm -ntm_warm_up_epochs [epoch_num, e.g., 100].

Jointly train the TAKG model, run: python train.py -data_tag Weibo_s100_t10 -copy_attention -use_topic_represent -load_pretrain_ntm -joint_train -topic_attn -check_pt_ntm_model_path [the warmed up ntm model path].

There are some common arguments about different joint training strategies and model variants:

joint_train: whether to jointly train the model, True or False
joint_train_strategy: how to jointly train the model, choices=['p_1_iterate', 'p_0_iterate', 'p_1_joint', 'p_0_joint'], iterate: iteratively train each module, joint: jointly train both modules, the digit (0/1): the epoch number of pretraining the seq2seq 
topic_type: use latent variable z or g as topic, g is the default
topic_dec: add topic in decoder input
topic_attn: add topic in context vector
topic_copy: add topic in copy switch
topic_attn_in: add topic in computing attn score
add_two_loss: use the sum of two losses as the objective 

Different ways of adding topic can be combined simultaneously. Let's say: -topic_attn -topic_dec -topic_copy.

Inference

To generate the prediction, run: python predict.py -model [seq2seq model path] (-ntm_model [ntm model path]).

Some common arguments are listed below:

batch_size: batch size during predicting. Decrease this number if it exceeds GPU memory.
beam_size: beam size. Decrease this number if it exceeds GPU memory.
n_best: default -1, pick the top n_best sequences from beam_search, if n_best < 0, then n_best=beam_size
max_length: max sequence length for each prediction.

It will output the prediction under the folder pred. For the format of prediction, please refer to my_sample_prediction.

Evaluation

To evaluate the predictions, run: python pred_evaluate.py -pred [prediction path] -src [data/Weibo/test_src.txt] -trg [data/Weibo/test_trg.txt]

It will output the performance in different metrics including Precision, Recall, F1 measure, MAP, NDCG at different top predictions. You can modify the top prediction number in the main function of pred_evaluate.py. It will also report results for present keyphrase, absent keyphrase, and both of them.

Citation

If you use either the code or data in your paper, please kindly star this repo and cite our paper:

@inproceedings{wang-etal-2019-topic-aware,
    title = "Topic-Aware Neural Keyphrase Generation for Social Media Language",
    author = "Wang, Yue  and
      Li, Jing  and
      Chan, Hou Pong  and
      King, Irwin  and
      Lyu, Michael R.  and
      Shi, Shuming",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1240",
    pages = "2516--2526",
    abstract = "A huge volume of user-generated content is daily produced on social media. To facilitate automatic language understanding, we study keyphrase prediction, distilling salient information from massive posts. While most existing methods extract words from source posts to form keyphrases, we propose a sequence-to-sequence (seq2seq) based neural keyphrase generation framework, enabling absent keyphrases to be created. Moreover, our model, being topic-aware, allows joint modeling of corpus-level latent topic representations, which helps alleviate data sparsity widely exhibited in social media language. Experiments on three datasets collected from English and Chinese social media platforms show that our model significantly outperforms both extraction and generation models without exploiting latent topics. Further discussions show that our model learns meaningful topics, which interprets its superiority in social media keyphrase generation.",
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].