lrn

Source code for "A Lightweight Recurrent Network for Sequence Modeling"

Model Architecture

In our new paper, we propose lightweight recurrent network, which combines the strengths of ATR and SRU.

ATR helps reduces model parameters and avoids additional free parameters for gate calculation, through the twin-gate mechanism
SRU follows the QRNN and moves all recurrent computations outside the recurrence.

Based on the above units, we propose LRN:

$\begin{align} \mathbf{q}_t, \mathbf{k}_t, \mathbf{v}_t = \mathbf{x}_t\mathbf{W}_q, \mathbf{x}_t\mathbf{W}_k, \mathbf{x}_t\mathbf{W}_v \\ \mathbf{i}_t = \sigma(\mathbf{k}_t + \mathbf{h}_{t-1}) \\ \mathbf{f}_t = \sigma(\mathbf{q}_t - \mathbf{h}_{t-1}) \\ \mathbf{h}_t = g(\mathbf{i}_t \odot \mathbf{v}_t + \mathbf{f}_t \odot \mathbf{h}_{t-1}) \end{align*}$

where g(·) is an activation function, tanh or identity. W_q, W_k and W_v are model parameters. The matrix computation (as well as potential layer noramlization) can be shfited outside the recurrence. Therefore, the whole model is fast in running.

When applying twin-gate mechanism, the output value in h_t might suffer explosion issue, which could grow into infinity. This is the reason we added the activation function. Another alternative solution would be using layer normalization, which forces activation values to be stable.

Structure Analysis

One way to understand the model is to unfold the LRN structure along input tokens:

$\mathbf{h}_t & = \sum_{k=1}^t \mathbf{i}_k \odot \left(\prod_{l=1}^{t-k}\mathbf{f}_{k+l}\right) \odot \mathbf{v}_k,$

The above structure which is also observed by Zhang et al., Lee et al., and etc, endows the RNN model with multiple interpretations. We provide two as follows:

Relation with Self Attention Networks

Informally, LRN assembles forget gates from step t to step k+1 in order to query the key (input gate). The result weight is assigned to the corresponding value representation and contributes to the final hidden representation.

Does the learned weights make sense? We do a classification tasks on AmaPolar task with a unidirectional linear-LRN. The final hidden state is feed into the classifier. One example below shows the learned weights. The term great gains a large weight, which decays slowly and contributes the final positive decision.

Long-term and Short-term Memory

Another view of the unfolded structure is that different gates form different memory mechanism. The input gate acts as a short-term memory and indicates how many information can be activated in this token. The forget gates form a forget chain that controls how to erase meaningless past information.

Experiments

We did experiment on six different tasks:

Citation

Please cite the following paper:

Biao Zhang; Rico Sennrich (2019). A Lightweight Recurrent Network for Sequence Modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy.

@inproceedings{zhang-sennrich:2019:ACL,
  address = "Florence, Italy",
  author = "Zhang, Biao and Sennrich, Rico",
  booktitle = "{Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}",
  publisher = "Association for Computational Linguistics",
  title = "{A Lightweight Recurrent Network for Sequence Modeling}",
  year = "2019"
}

Contact

For any further comments or questions about LRN, please email Biao Zhang.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

bzhangGo / lrn

Programming Languages

Labels

Projects that are alternatives of or similar to lrn

lrn

Model Architecture

Structure Analysis

Experiments

Citation

Contact