All Projects → jeffchy → RE2RNN

jeffchy / RE2RNN

Licence: other
Source code for the EMNLP 2020 paper "Cold-Start and Interpretability: Turning Regular Expressions intoTrainable Recurrent Neural Networks"

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to RE2RNN

Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+29.17%)
Mutual labels:  text-classification, regular-expression
BUFFY
Back Up Files For You
Stars: ✭ 19 (-80.21%)
Mutual labels:  regular-expression
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (-53.12%)
Mutual labels:  text-classification
LearningReIn30Mins
正则表达式30分钟入门
Stars: ✭ 20 (-79.17%)
Mutual labels:  regular-expression
ERNIE-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained ERNIE model for text classification.
Stars: ✭ 49 (-48.96%)
Mutual labels:  text-classification
transfer-learning-text-tf
Tensorflow implementation of Semi-supervised Sequence Learning (https://arxiv.org/abs/1511.01432)
Stars: ✭ 82 (-14.58%)
Mutual labels:  text-classification
Text-Classification-PyTorch
Implementation of papers for text classification task on SST-1/SST-2
Stars: ✭ 57 (-40.62%)
Mutual labels:  text-classification
20-newsgroups text-classification
"20 newsgroups" dataset - Text Classification using Multinomial Naive Bayes in Python.
Stars: ✭ 41 (-57.29%)
Mutual labels:  text-classification
RegexReplacer
A flexible tool to make complex replacements with regular expression
Stars: ✭ 38 (-60.42%)
Mutual labels:  regular-expression
doi-regex
Regular expression for matching DOIs
Stars: ✭ 28 (-70.83%)
Mutual labels:  regular-expression
TextClassification
基于scikit-learn实现对新浪新闻的文本分类,数据集为100w篇文档,总计10类,测试集与训练集1:1划分。分类算法采用SVM和Bayes,其中Bayes作为baseline。
Stars: ✭ 86 (-10.42%)
Mutual labels:  text-classification
text gcn tutorial
A tutorial & minimal example (8min on CPU) for Graph Convolutional Networks for Text Classification. AAAI 2019
Stars: ✭ 23 (-76.04%)
Mutual labels:  text-classification
keras-aquarium
a small collection of models implemented in keras, including matrix factorization(recommendation system), topic modeling, text classification, etc. Runs on tensorflow.
Stars: ✭ 14 (-85.42%)
Mutual labels:  text-classification
regex-not
Create a javascript regular expression for matching everything except for the given string.
Stars: ✭ 31 (-67.71%)
Mutual labels:  regular-expression
BERT-chinese-text-classification-pytorch
This repo contains a PyTorch implementation of a pretrained BERT model for text classification.
Stars: ✭ 92 (-4.17%)
Mutual labels:  text-classification
compiler-design-lab
These are my programs for compiler design lab work in my sixth semester
Stars: ✭ 47 (-51.04%)
Mutual labels:  regular-expression
pcre-heavy
A Haskell regular expressions library that doesn't suck | now on https://codeberg.org/valpackett/pcre-heavy
Stars: ✭ 52 (-45.83%)
Mutual labels:  regular-expression
ulm-basenet
Implementation of ULMFit algorithm for text classification via transfer learning
Stars: ✭ 94 (-2.08%)
Mutual labels:  text-classification
Very-deep-cnn-tensorflow
Very deep CNN for text classification
Stars: ✭ 18 (-81.25%)
Mutual labels:  text-classification
regex
Regular expressions for Prolog
Stars: ✭ 16 (-83.33%)
Mutual labels:  regular-expression

RE2RNN

Source code for the EMNLP2020 paper: "Cold-start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks", Chengyue Jiang, Yinggong Zhao, Shanbo Chu, Libin Shen, and Kewei Tu.

Citation

@inproceedings{jiang-etal-2020-cold,
    title = "Cold-start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks",
    author = "Jiang, Chengyue  and
      Zhao, Yinggong  and
      Chu, Shanbo  and
      Shen, Libin  and
      Tu, Kewei",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.258",
    pages = "3193--3207",
}

Requirements

  • pytorch 1.3.1
  • tensorly 0.5.0
  • numpy
  • tqdm
  • automata-tools
  • pyparsing

Data

Raw dataset files, preprocessed dataset files, glove word embedding matrix, rules for each dataset, and the decomposed automata files can be downloaded here: Google Drive, Tencent Drive.

You can download and extract the zip file, and replace the original data directory. The directory structure should be:

.
├── data
│   ├── ATIS
│   │   ├── automata
│   │   ├── ....
│   ├── TREC
│   │   ├── automata
│   │   ├── ....
│   ├── ....
├── src
│   ├── ....
├── src_simple
│   ├── ....
├── model
│   ├── ....
├── imgs
│   ├── ....

If you have done these, you can skip to the training part.

Regular Expressions

We provide the RE rules for three datasets, ATIS, QC(TREC-6) and SMS. Our REs are word-level, not char-level. We show the symbols and their meanings in the following table.

Symbol Meaning
$ wildcard
% numbers, e.g. 5, 1996
& punctuations
? 0 or 1 occurrence
* zero or more occurrences
+ one or more occurrences
(a|b) a or b

Regular expressions Examples

ATIS - abbreviation label.

[abbreviation]
( $ * ( mean | ( stand | stands ) for | code ) $ * ) | ( $ * what is $ EOS )

SMS - spam label.

[spam]
$ * dating & * $ * call $*

Regular Expression to FA

We show examples on ATIS dataset, for other datasets, simply change --dataset option to TREC or SMS.

prepare the dataset

You need first download the GloVe 6B embeddings, and place the embedding files into data/emb/glove.6B/ You can also prepare the dataset from the raw dataset by running the following command.

python data.py --dataset ATIS

RE to FA

We turn the regular expressions into a finite automaton using our automata-tools package implemented by (@linonetwo). This tool is modified based on https://github.com/sdht0/automata-from-regex. This package require the 'dot' command for drawing the automata.

Or running the following command to convert REs/reversed RE (for backward direction) to FA.

python create_automata.py --dataset ATIS --automata_name all --reversed 0
python create_automata.py --dataset ATIS --automata_name all --reversed 1

The regular expression for ATIS - abbreviation mentioned above can be represented using following automaton. avatar

Run the REs

The RE system's result is got by running the un-decomposed automaton you just created.

python main.py --model_type Onehot --dataset ATIS --only_probe 1 --wfa_type viterbi \
--epoch 0 --automata_path_forward all.1105155534-1604591734.6171093.split.pkl --seed 0

Decomposing FAs

We convert the FAs using tensor-rank decomposition.

Run the following command to convert REs to FA.

python decompose_automata.py --dataset ATIS --automata_name automata_name --rank 150 --init svd

FAs to FA-RNN and training FA-RNN.

To train the initialize the FA-RNNs on ATIS, SMS, and TREC make sure you finish the above steps. Then let's train an FA-RNN initialized by the decomposed automata. If you have downloaded automata and place them into the right location, you can run.

python main.py --dataset ATIS --run save_dir --model_type FSARNN --beta 0.9 \
 --wfa_type forward --seed 0 --lr 0.001 --bidirection 0 --farnn 0 --random 0

Please check the function get_automata_from_seed in utils/utils.py to understand which automaton you are using.

If you use newly decomposed automata, you need to specify the --automata_path_forward and --automata_path_backward options.

For example: FARNN

python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 0 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 0 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl

For example: BiFARNN

python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 1 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 0 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl \
--automata_path_backward automata.newrule.reversed.randomseed150.False.0.0735.0.pkl

For example: FAGRU

python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 0 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 1 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl

We also remove some options and unimportant code to provide a cleaner version of code in /src_simple, in which only contains FARNN related code. As an example:

python main_min.py --dataset ATIS --run save_dir --model_type FSARNN --beta 0.3

Interpretability and Models.

You first need to download the FA-RNN models and config files here: Google Drive, Tencent Drive. Please place the files in the /model directory.

To seed the log and hyper-parameters of these provided model, simple using pickle to load the .res config files. For example, to achieve the hyper-params for model D0.9739-T0.9653-DI0.8655-TI0.8645-1106095843-1604656723.555744-ATIS-0.model, you can run:

import pickle
pickle.load(open('1106095843-1604656723.5809364.res', 'rb'))

Note that some useless hyper-parameters in the config files are cleaned in the final version/simple version, so the config file may not be directly used, just filter out the useless hyper-params.

We provide several examples showing how to convert the trained model parameters back into WFAs, and threshold them into NFA. See the file jupyter/checkTrainedRules.ipynb.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].