All Projects → pytorch → Text

pytorch / Text

Licence: bsd-3-clause
Data loaders and abstractions for text and NLP

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
shell
77523 projects
Batchfile
5799 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to Text

Pytorch Nlp
Basic Utilities for PyTorch Natural Language Processing (NLP)
Stars: ✭ 1,996 (-31.53%)
Mutual labels:  dataset, data-loader
Chazutsu
The tool to make NLP datasets ready to use
Stars: ✭ 238 (-91.84%)
Mutual labels:  dataset
H36m Fetch
Human 3.6M 3D human pose dataset fetcher
Stars: ✭ 220 (-92.45%)
Mutual labels:  dataset
Datasets
source{d} datasets ("big code") for source code analysis and machine learning on source code
Stars: ✭ 231 (-92.08%)
Mutual labels:  dataset
Stationary
Get hourly meteorological data from one of thousands of global stations
Stars: ✭ 225 (-92.28%)
Mutual labels:  dataset
University1652 Baseline
ACM Multimedia2020 University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization 🚁 annotates 1652 buildings in 72 universities around the world.
Stars: ✭ 232 (-92.04%)
Mutual labels:  dataset
Bccd dataset
BCCD (Blood Cell Count and Detection) Dataset is a small-scale dataset for blood cells detection.
Stars: ✭ 216 (-92.59%)
Mutual labels:  dataset
Cocostuff10k
The official homepage of the (outdated) COCO-Stuff 10K dataset.
Stars: ✭ 248 (-91.49%)
Mutual labels:  dataset
Covid 19 Repo Data
Data archive of identifiable COVID-19 related public projects on GitHub
Stars: ✭ 236 (-91.9%)
Mutual labels:  dataset
Structured3d
[ECCV'20] Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling
Stars: ✭ 224 (-92.32%)
Mutual labels:  dataset
Weatherbench
A benchmark dataset for data-driven weather forecasting
Stars: ✭ 227 (-92.21%)
Mutual labels:  dataset
Torchdata
PyTorch dataset extended with map, cache etc. (tensorflow.data like)
Stars: ✭ 226 (-92.25%)
Mutual labels:  dataset
Img2poem
Stars: ✭ 238 (-91.84%)
Mutual labels:  dataset
Automated Resume Screening System
Automated Resume Screening System using Machine Learning (With Dataset)
Stars: ✭ 224 (-92.32%)
Mutual labels:  dataset
Retriever
Quickly download, clean up, and install public datasets into a database management system
Stars: ✭ 241 (-91.73%)
Mutual labels:  dataset
Collection
Collection Data for Cooper Hewitt, Smithsonian Design Museum
Stars: ✭ 214 (-92.66%)
Mutual labels:  dataset
Stocknet Dataset
A comprehensive dataset for stock movement prediction from tweets and historical stock prices.
Stars: ✭ 228 (-92.18%)
Mutual labels:  dataset
Datalad
Keep code, data, containers under control with git and git-annex
Stars: ✭ 234 (-91.97%)
Mutual labels:  dataset
Recommendersystem Dataset
This repository contains some datasets that I have collected in Recommender Systems.
Stars: ✭ 249 (-91.46%)
Mutual labels:  dataset
Taco
🌮 Trash Annotations in Context Dataset Toolkit
Stars: ✭ 243 (-91.66%)
Mutual labels:  dataset
https://circleci.com/gh/pytorch/text.svg?style=svg https://img.shields.io/badge/dynamic/json.svg?label=docs&url=https%3A%2F%2Fpypi.org%2Fpypi%2Ftorchtext%2Fjson&query=%24.info.version&colorB=brightgreen&prefix=v

torchtext

This repository consists of:

Note: The legacy code discussed in torchtext v0.7.0 release note has been retired to torchtext.legacy folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See torchtext.legacy folder for more details.

Installation

We recommend Anaconda as a Python package management system. Please refer to pytorch.org for the details of PyTorch installation. The following are the corresponding torchtext versions and supported Python versions.

Version Compatibility
PyTorch version torchtext version Supported Python version
nightly build main >=3.6, <=3.9
1.10.0 0.11.0 >=3.6, <=3.9
1.9.1 0.10.1 >=3.6, <=3.9
1.9 0.10 >=3.6, <=3.9
1.8.2 0.9.2 >=3.6, <=3.9
1.8.1 0.9.1 >=3.6, <=3.9
1.8 0.9 >=3.6, <=3.9
1.7.1 0.8.1 >=3.6, <=3.9
1.7 0.8 >=3.6, <=3.8
1.6 0.7 >=3.6, <=3.8
1.5 0.6 >=3.5, <=3.8
1.4 0.5 2.7, >=3.5, <=3.8
0.4 and below 0.2.3 2.7, >=3.5, <=3.8

Using conda:

conda install -c pytorch torchtext

Using pip:

pip install torchtext

Optional requirements

If you want to use English tokenizer from SpaCy, you need to install SpaCy and download its English model:

pip install spacy
python -m spacy download en_core_web_sm

Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses:

pip install sacremoses

For torchtext 0.5 and below, sentencepiece:

conda install -c powerai sentencepiece

Building from source

To build torchtext from source, you need git, CMake and C++11 compiler such as g++.:

git clone https://github.com/pytorch/text torchtext
cd torchtext
git submodule update --init --recursive

# Linux
python setup.py clean install

# OSX
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py clean install

# or ``python setup.py develop`` if you are making modifications.

Note

When building from source, make sure that you have the same C++ compiler as the one used to build PyTorch. A simple way is to build PyTorch from source and use the same environment to build torchtext. If you are using the nightly build of PyTorch, checkout the environment it was built with conda (here) and pip (here).

Documentation

Find the documentation here.

Datasets

The datasets module currently contains:

  • Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
  • Machine translation: IWSLT2016, IWSLT2017, Multi30k
  • Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
  • Question answering: SQuAD1, SQuAD2
  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB

For example, to access the raw text from the AG_NEWS dataset:

>>> from torchtext.datasets import AG_NEWS
>>> train_iter = AG_NEWS(split='train')
>>> next(train_iter)
>>> # Or iterate with for loop
>>> for (label, line) in train_iter:
>>>     print(label, line)
>>> # Or send to DataLoader
>>> from torch.utils.data import DataLoader
>>> train_iter = AG_NEWS(split='train')
>>> dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)

Tutorials

To get started with torchtext, users may refer to the following tutorials available on PyTorch website.

[BC Breaking] Legacy

In the v0.9.0 release, we moved the following legacy code to torchtext.legacy. This is part of the work to revamp the torchtext library and the motivation has been discussed in Issue #664:

  • torchtext.legacy.data.field
  • torchtext.legacy.data.batch
  • torchtext.legacy.data.example
  • torchtext.legacy.data.iterator
  • torchtext.legacy.data.pipeline
  • torchtext.legacy.datasets

We have a migration tutorial to help users switch to the torchtext datasets in v0.9.0 release. For the users who still want the legacy components, they can add legacy to the import path.

In the v0.10.0 release, we retire the Vocab class to torchtext.legacy. Users can still access the legacy Vocab via torchtext.legacy.vocab. This class has been replaced by a Vocab module that is backed by efficient C++ implementation and provides common functional APIs for NLP workflows.

Disclaimer on Datasets

This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].