All Projects → gau820827 → Ai Writer_data2doc

gau820827 / Ai Writer_data2doc

Licence: apache-2.0
PyTorch Implementation of NBA game summary generator.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Ai Writer data2doc

Tf Seq2seq
Sequence to sequence learning using TensorFlow.
Stars: ✭ 387 (+460.87%)
Mutual labels:  natural-language-processing, seq2seq
Question generation
It is a question-generator model. It takes text and an answer as input and outputs a question.
Stars: ✭ 166 (+140.58%)
Mutual labels:  natural-language-processing, seq2seq
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+80685.51%)
Mutual labels:  natural-language-processing, seq2seq
Pytorch Beam Search Decoding
PyTorch implementation of beam search decoding for seq2seq models
Stars: ✭ 204 (+195.65%)
Mutual labels:  natural-language-processing, seq2seq
Trade Dst
Source code for transferable dialogue state generator (TRADE, Wu et al., 2019). https://arxiv.org/abs/1905.08743
Stars: ✭ 287 (+315.94%)
Mutual labels:  natural-language-processing, seq2seq
Text summarization with tensorflow
Implementation of a seq2seq model for summarization of textual data. Demonstrated on amazon reviews, github issues and news articles.
Stars: ✭ 226 (+227.54%)
Mutual labels:  natural-language-processing, seq2seq
Spark Nlp
State of the Art Natural Language Processing
Stars: ✭ 2,518 (+3549.28%)
Mutual labels:  natural-language-processing, seq2seq
Multiwoz
Source code for end-to-end dialogue model from the MultiWOZ paper (Budzianowski et al. 2018, EMNLP)
Stars: ✭ 384 (+456.52%)
Mutual labels:  natural-language-processing, seq2seq
Practical Pytorch
Go to https://github.com/pytorch/tutorials - this repo is deprecated and no longer maintained
Stars: ✭ 4,329 (+6173.91%)
Mutual labels:  natural-language-processing, seq2seq
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-11.59%)
Mutual labels:  natural-language-processing
Multilingual Latent Dirichlet Allocation Lda
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.
Stars: ✭ 64 (-7.25%)
Mutual labels:  natural-language-processing
Language Models
Build unigram and bigram language models, implement Laplace smoothing and use the models to compute the perplexity of test corpora.
Stars: ✭ 59 (-14.49%)
Mutual labels:  natural-language-processing
Slate
A Super-Lightweight Annotation Tool for Experts: Label text in a terminal with just Python
Stars: ✭ 61 (-11.59%)
Mutual labels:  natural-language-processing
Kor2vec
Library for Korean morpheme and word vector representation
Stars: ✭ 64 (-7.25%)
Mutual labels:  natural-language-processing
Fromscratch
Stars: ✭ 61 (-11.59%)
Mutual labels:  natural-language-processing
Capsnet Nlp
CapsNet for NLP
Stars: ✭ 66 (-4.35%)
Mutual labels:  natural-language-processing
Textblob Ar
Arabic support for textblob
Stars: ✭ 60 (-13.04%)
Mutual labels:  natural-language-processing
Bidaf Keras
Bidirectional Attention Flow for Machine Comprehension implemented in Keras 2
Stars: ✭ 60 (-13.04%)
Mutual labels:  natural-language-processing
Hackerrank
This is the Repository where you can find all the solution of the Problems which you solve on competitive platforms mainly HackerRank and HackerEarth
Stars: ✭ 68 (-1.45%)
Mutual labels:  natural-language-processing
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+1540.58%)
Mutual labels:  natural-language-processing

AI writter - Data2Doc

This is the project for automatically generating summarizatons given NBA game box-score.

Requirements

  1. Python 3.6
  2. PyTorch 0.2

Data set

We use Rotowire dataset for training as in Challenges in Data-to-Document Generation (Wiseman, Shieber, Rush; EMNLP 2017). This dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores.

Basic Usage

  1. First extract the dataset, using tar -jxvf boxscore-data/rotowire.tar.bz2.
  2. Go to directory train/, using cd train
  3. (Optional) Train the model, using python3 train.py.
  4. Generate some text, using python3 small_evaluate.py

Some pre-trained model files could be found here. Extract them under model/ directory.

  • Sample Generate Summary using BiLSTM encoder
The Chicago Bulls defeated the Orlando Magic , 102 - 92 , at
United Center on Saturday . The Bulls ( 10 - 2 ) were expected
to win this game rather easily and they did n’t disappoint in the
fourth quarter . In fact , they outscored the Magic by 12 points in
the fourth quarter to pull out the win . Defense was key for the
Bulls , as they held the Magic to 37 percent from the field and 25
percent from three - point range . The Magic also dominated the
rebounding , winning that battle , 44 - 32 . The Magic ( 8 - 18 )
had to play this game without Joel Embiid and Nikola Vucevic ,
as they simply did n’t have enough to get out of hand . Jimmy
Butler was the player of the game , as he tallied 23 points , five
rebounds and four assists . Aaron Gordon was the only other
starter with more than 10 points , as he totaled 15 points , five
rebounds and four assists. . . ...

Arguments

The train/train.py accepts the following arguments.

(The default configurations for each argument can be found in train/settings.py.)

  # Parameters for model
  -embed EMBEDDING_SIZE,    the hidden size for embedding,, default = 600
  -lr LR,                   initial learning rate, default = 0.01
  -batch BATCH_SIZE,        batch size, default = 2
  
  # Parameters for model
  -encoder ENCODER_STYLE,   type of encoder NN (LIN, BiLSTM, RNN, BiLSTMMax, HierarchicalRNN,
                            HierarchicalBiLSTM, HierarchicalLIN)
  -decoder DECODER_STYLE,   type of decoder NN (RNN, HierarchcialRNN)
  -copy,                    if apply pointer-generator network(True, False), default = False
 
  # Parameters for training
  -gradclip GRAD_CLIP,      gradient clipping, default = 2
  -pretrain PRETRAIN,       file name of pretrained model (must assign with iternum)
  -iternum ITER_NUM,        file name of pretraiend model (must assign with pretrain)
  -layer LAYER_DEPTH,       the depth of recurrent units, default = 2; no depth for linear units
  -copyplayer COPY_PLAYER,  if include player's information in data, default = False 
  -epoch EPOCH_TIME,        maximum epoch time for training
  -maxlength MAX_LENGTH,    maximum words for each sentence
  -maxsentece MAX_SENTECE,  limit the maximum length for training. Set lower (e.g 5) for faster training speed. If not specify, program will train entire corpus.

  # Parameters for display
  -getloss GET_LOSS,        print out average loss every `GET_LOSS` iterations.
  -epochsave SAVE_MODEL,    save the model every `SAVE_MODEL` epochs.
  -outputfile OUTPUT_FILE,  starting name for saving the model. During training, encoder and decoder would be saved as `[OUTPUT_FILE]_[encoder|decoder]_[iter_time]` every `SAVE_MODEL` epochs.

With above arguments, a variety of configurations could be trained:

python train.py # This will train using default settings in `train/settings.py`

python train.py -embed 300 -lr 0.01 -batch 3 -getloss 20 -encoder HierarchicalRNN 
                -decoder HierarchicalRNN -epochsave 12 -copy True -copyplayer False 
                -gradclip 2 -layer 2 -epoch 3 -outputfile pretrain_copy 
                -pretrain hbilstm -iternum 200
                # -pretrain and -iternum must be specified together
                # the corresponding pretrained model name will be in the format:
                #    [pretrain]_[encoder|decoder|optim]_[iter_num]

python train.py -embed 720 -lr 0.02 -batch 3 -getloss 10 -encoder HierarchicalLIN 
                -decoder HierarchicalRNN -epochsave 5 -copy True -copyplayer True 
                -gradclip 3 -maxsentence 800  -epoch 3

python train.py -embed 512 -lr 0.03 -batch 3 -getloss 10 -encoder BiLSTM 
                -decoder RNN -epochsave 12 -copy True -copyplayer True 
                -gradclip 3 -maxsentence 230 -layer 2 -epoch 3

python train.py -embed 600 -lr 0.02 -batch 3 -getloss 10 -encoder LIN 
                -decoder RNN -epochsave 12 -copy True -copyplayer False 
                -gradclip 5 -maxsentence 800 -layer 2 -epoch 3

More Details

Train

In train/ directory is the part of data2text generation. The files are for this part include:

  • dataprepare.py -- word2index map class, storing the vocabulary and relations
  • model.py -- The file that contains the implementation of several encoder, decoder and embedding model class
  • preprocessing.py -- mainly for read of parse the data
  • train.py -- the file that defines the training processes
  • util.py -- utility functions for time, showing etc.
  • setting.py -- store the hyper-parameter, file location etc.

Evaulate

In evaluate/ directory is the extraction evaluation system, based on Challenges in Data-to-Document Generation (Wiseman, Shieber, Rush; EMNLP 2017) and part of their codes. The files from their repo contains:

  • data_utils.py -- for the data parsing and cleaning
  • extractor.lua -- the main script for relation classifier
  • non_rg_metrics.py -- CS and RO computation

Thanks to the dataset and code from Wiseman et. al.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].