All Projects → PetrochukM → Pytorch Nlp

PetrochukM / Pytorch Nlp

Licence: bsd-3-clause
Basic Utilities for PyTorch Natural Language Processing (NLP)

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to Pytorch Nlp

Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-97.09%)
Mutual labels:  dataset, natural-language-processing
Magnitude
A fast, efficient universal vector embedding utility package.
Stars: ✭ 1,394 (-30.16%)
Mutual labels:  natural-language-processing, embeddings
Cesi
WWW 2018: CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information
Stars: ✭ 85 (-95.74%)
Mutual labels:  dataset, embeddings
Wikisql
A large annotated semantic parsing corpus for developing natural language interfaces.
Stars: ✭ 965 (-51.65%)
Mutual labels:  dataset, natural-language-processing
Chars2vec
Character-based word embeddings model based on RNN for handling real world texts
Stars: ✭ 130 (-93.49%)
Mutual labels:  natural-language-processing, embeddings
Mtnt
Code for the collection and analysis of the MTNT dataset
Stars: ✭ 48 (-97.6%)
Mutual labels:  dataset, natural-language-processing
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-95.19%)
Mutual labels:  dataset, natural-language-processing
Speedtorch
Library for faster pinned CPU <-> GPU transfer in Pytorch
Stars: ✭ 615 (-69.19%)
Mutual labels:  natural-language-processing, embeddings
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-93.94%)
Mutual labels:  dataset, natural-language-processing
Awesome Embedding Models
A curated list of awesome embedding models tutorials, projects and communities.
Stars: ✭ 1,486 (-25.55%)
Mutual labels:  natural-language-processing, embeddings
Bpemb
Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
Stars: ✭ 909 (-54.46%)
Mutual labels:  natural-language-processing, embeddings
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (-93.04%)
Mutual labels:  dataset, natural-language-processing
Insuranceqa Corpus Zh
🚁 保险行业语料库,聊天机器人
Stars: ✭ 821 (-58.87%)
Mutual labels:  dataset, natural-language-processing
Coarij
Corpus of Annual Reports in Japan
Stars: ✭ 55 (-97.24%)
Mutual labels:  dataset, natural-language-processing
Wikipedia2vec
A tool for learning vector representations of words and entities from Wikipedia
Stars: ✭ 655 (-67.18%)
Mutual labels:  natural-language-processing, embeddings
Pytreebank
😡😇 Stanford Sentiment Treebank loader in Python
Stars: ✭ 93 (-95.34%)
Mutual labels:  dataset, natural-language-processing
Ner Lstm
Named Entity Recognition using multilayered bidirectional LSTM
Stars: ✭ 532 (-73.35%)
Mutual labels:  natural-language-processing, embeddings
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (-72.8%)
Mutual labels:  dataset, natural-language-processing
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (-94.59%)
Mutual labels:  dataset, natural-language-processing
Mams For Absa
A Multi-Aspect Multi-Sentiment Dataset for aspect-based sentiment analysis.
Stars: ✭ 135 (-93.24%)
Mutual labels:  dataset, natural-language-processing

Basic Utilities for PyTorch Natural Language Processing (NLP)

PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. torchnlp extends PyTorch to provide you with basic text data processing functions.

PyPI - Python Version Codecov Downloads Documentation Status Build Status Twitter: PetrochukM

Logo by Chloe Yeo, Corporate Sponsorship by WellSaid Labs

Installation 🐾

Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install pytorch-nlp using pip:

pip install pytorch-nlp

Or to install the latest code via:

pip install git+https://github.com/PetrochukM/PyTorch-NLP.git

Docs

The complete documentation for PyTorch-NLP is available via our ReadTheDocs website.

Get Started

Within an NLP data pipeline, you'll want to implement these basic steps:

1. Load your Data 🐿

Load the IMDB dataset, for example:

from torchnlp.datasets import imdb_dataset

# Load the imdb training dataset
train = imdb_dataset(train=True)
train[0]  # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}

Load a custom dataset, for example:

from pathlib import Path

from torchnlp.download import download_file_maybe_extract

directory_path = Path('data/')
train_file_path = Path('trees/train.txt')

download_file_maybe_extract(
    url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
    directory=directory_path,
    check_files=[train_file_path])

open(directory_path / train_file_path)

Don't worry we'll handle caching for you!

2. Text to Tensor

Tokenize and encode your text as a tensor.

For example, a WhitespaceEncoder breaks text into tokens whenever it encounters a whitespace character.

from torchnlp.encoders.text import WhitespaceEncoder

loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = WhitespaceEncoder(loaded_data)
encoded_data = [encoder.encode(example) for example in loaded_data]

3. Tensor to Batch

With your loaded and encoded data in hand, you'll want to batch your dataset.

import torch
from torchnlp.samplers import BucketBatchSampler
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors

encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]

train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
train_batch_sampler = BucketBatchSampler(
    train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])

batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]

PyTorch-NLP builds on top of PyTorch's existing torch.utils.data.sampler, torch.stack and default_collate to support sequential inputs of varying lengths!

4. Training and Inference

With your batch in hand, you can use PyTorch to develop and train your model using gradient descent. For example, check out this example code for training on the Stanford Natural Language Inference (SNLI) Corpus.

Last But Not Least

PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗

Deterministic Functions

Now you've setup your pipeline, you may want to ensure that some functions run deterministically. Wrap any code that's random, with fork_rng and you'll be good to go, like so:

import random
import numpy
import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123):  # Ensure determinism
    print('Random:', random.randint(1, 2**31))
    print('Numpy:', numpy.random.randint(1, 2**31))
    print('Torch:', int(torch.randint(1, 2**31, (1,))))

This will always print:

Random: 224899943
Numpy: 843828735
Torch: 843828736

Pre-Trained Word Vectors

Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings, like so:

import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe

encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])

vocab_set = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
    embedding_weights[i] = pretrained_embedding[token]

Neural Networks Layers

For example, from the neural network package, apply the state-of-the-art LockedDropout:

import torch
from torchnlp.nn import LockedDropout

input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)

# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)

Metrics

Compute common NLP metrics such as the BLEU score.

from torchnlp.metrics import get_moses_multi_bleu

hypotheses = ["The brown fox jumps over the dog 笑"]
references = ["The quick brown fox jumps over the lazy dog 笑"]

# Compute BLEU score with the official BLEU perl script
get_moses_multi_bleu(hypotheses, references, lowercase=True)  # RETURNS: 47.9

Help

Maybe looking at longer examples may help you at examples/.

Need more help? We are happy to answer your questions via Gitter Chat

Contributing

We've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope that other organizations can benefit from the project. We are thankful for any contributions from the community.

Contributing Guide

Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to PyTorch-NLP.

Related Work

torchtext

torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar. torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders. PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint, torchtext is object orientated with external coupling while PyTorch-NLP is object orientated with low coupling.

AllenNLP

AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.

Authors

Citing

If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to cite it:

@misc{pytorch-nlp,
  author = {Petrochuk, Michael},
  title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}},
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].