All Projects → dsfsi → textaugment

dsfsi / textaugment

Licence: MIT license
TextAugment: Text Augmentation Library

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to textaugment

deep utils
An open-source toolkit which is full of handy functions, including the most used models and utilities for deep-learning practitioners!
Stars: ✭ 73 (-73.93%)
Mutual labels:  augmentation
acl2017 document clustering
code for "Determining Gains Acquired from Word Embedding Quantitatively Using Discrete Distribution Clustering" ACL 2017
Stars: ✭ 21 (-92.5%)
Mutual labels:  word2vec
img classification deep learning
No description or website provided.
Stars: ✭ 19 (-93.21%)
Mutual labels:  word2vec
doc2vec-api
document embedding and machine learning script for beginners
Stars: ✭ 92 (-67.14%)
Mutual labels:  word2vec
wn
A modern, interlingual wordnet interface for Python
Stars: ✭ 119 (-57.5%)
Mutual labels:  wordnet
receiptdID
Receipt.ID is a multi-label, multi-class, hierarchical classification system implemented in a two layer feed forward network.
Stars: ✭ 22 (-92.14%)
Mutual labels:  word2vec
asm2vec
An unofficial implementation of asm2vec as a standalone python package
Stars: ✭ 127 (-54.64%)
Mutual labels:  word2vec
word2vec-pytorch
Extremely simple and fast word2vec implementation with Negative Sampling + Sub-sampling
Stars: ✭ 145 (-48.21%)
Mutual labels:  word2vec
stackoverflow-semantic-search
Word2Vec encodings based search engine for Stackoverflow questions
Stars: ✭ 23 (-91.79%)
Mutual labels:  word2vec
Name-disambiguation
同名论文消歧的工程化方案(参考2019智源-aminer人名消歧竞赛第一名方案)
Stars: ✭ 17 (-93.93%)
Mutual labels:  word2vec
Word-Embeddings-and-Document-Vectors
An evaluation of word-embeddings for classification
Stars: ✭ 32 (-88.57%)
Mutual labels:  word2vec
ws4j
WordNet Similarity for Java provides an API for several Semantic Relatedness/Similarity algorithms
Stars: ✭ 41 (-85.36%)
Mutual labels:  wordnet
wordmap
Visualize large text collections with WebGL
Stars: ✭ 23 (-91.79%)
Mutual labels:  word2vec
SnapMix
SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021)
Stars: ✭ 127 (-54.64%)
Mutual labels:  mixup
mix3d
Mix3D: Out-of-Context Data Augmentation for 3D Scenes (3DV 2021 Oral)
Stars: ✭ 183 (-34.64%)
Mutual labels:  augmentation
wordnet
Stand-alone WordNet API
Stars: ✭ 39 (-86.07%)
Mutual labels:  wordnet
CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
Stars: ✭ 26 (-90.71%)
Mutual labels:  wordnet
word2vec-on-wikipedia
A pipeline for training word embeddings using word2vec on wikipedia corpus.
Stars: ✭ 68 (-75.71%)
Mutual labels:  word2vec
Wordbook
Wordbook is a dictionary application built for GNOME.
Stars: ✭ 56 (-80%)
Mutual labels:  wordnet
Emotion-recognition-from-tweets
A comprehensive approach on recognizing emotion (sentiment) from a certain tweet. Supervised machine learning.
Stars: ✭ 17 (-93.93%)
Mutual labels:  word2vec

TextAugment: Improving Short Text Classification through Global Augmentation Methods

licence GitHub release Wheel python TotalDownloads Downloads LNCS arxiv

You have just found TextAugment.

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim, and TextBlob and plays nicely with them.

Table of Contents

Features

  • Generate synthetic data for improving model performance without manual effort
  • Simple, lightweight, easy-to-use library.
  • Plug and play to any machine learning frameworks (e.g. PyTorch, TensorFlow, Scikit-learn)
  • Support textual data

Citation Paper

Improving short text classification through global augmentation methods.

alt text

Requirements

  • Python 3

The following software packages are dependencies and will be installed automatically.

$ pip install numpy nltk gensim textblob googletrans 

The following code downloads NLTK corpus for wordnet.

nltk.download('wordnet')

The following code downloads NLTK tokenizer. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

nltk.download('punkt')

The following code downloads default NLTK part-of-speech tagger model. A part-of-speech tagger processes a sequence of words, and attaches a part of speech tag to each word.

nltk.download('averaged_perceptron_tagger')

Use gensim to load a pre-trained word2vec model. Like Google News from Google drive.

import gensim
model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)

You can also use gensim to load Facebook's Fasttext English and Multilingual models

import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')

Or training one from scratch using your data or the following public dataset:

Installation

Install from pip [Recommended]

$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git

Install from source

$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install

How to use

There are three types of augmentations which can be used:

  • word2vec
from textaugment import Word2vec
  • wordnet
from textaugment import Wordnet
  • translate (This will require internet access)
from textaugment import Translate

Word2vec-based augmentation

See this notebook for an example

Basic example

>>> from textaugment import Word2vec
>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself')
>>> t.augment('The stories are good')
The films are good

Advanced example

>>> runs = 1 # By default.
>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> t = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
>>> t.augment('The stories are good')
The movies are excellent

WordNet-based augmentation

Basic example

>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
>>> from textaugment import Wordnet
>>> t = Wordnet()
>>> t.augment('In the afternoon, John is going to town')
In the afternoon, John is walking to town

Advanced example

>>> v = True # enable verbs augmentation. By default is True.
>>> n = False # enable nouns augmentation. By default is False.
>>> runs = 1 # number of times to augment a sentence. By default is 1.
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> t = Wordnet(v=False ,n=True, p=0.5)
>>> t.augment('In the afternoon, John is going to town')
In the afternoon, Joseph is going to town.

RTT-based augmentation

Example

>>> src = "en" # source language of the sentence
>>> to = "fr" # target language
>>> from textaugment import Translate
>>> t = Translate(src="en", to="fr")
>>> t.augment('In the afternoon, John is going to town')
In the afternoon John goes to town

EDA: Easy data augmentation techniques for boosting performance on text classification tasks

This is the implementation of EDA by Jason Wei and Kai Zou.

https://www.aclweb.org/anthology/D19-1670.pdf

See this notebook for an example

Synonym Replacement

Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.synonym_replacement("John is going to town")
John is give out to town

Random Deletion

Randomly remove each word in the sentence with probability p.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_deletion("John is going to town", p=0.2)
is going to town

Random Swap

Randomly choose two words in the sentence and swap their positions. Do this n times.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_swap("John is going to town")
John town going to is

Random Insertion

Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_insertion("John is going to town")
John is going to make up town

Mixup augmentation

This is the implementation of mixup augmentation by Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz adapted to NLP.

Used in Augmenting Data with Mixup for Sentence Classification: An Empirical Study.

Mixup is a generic and straightforward data augmentation principle. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularises the neural network to favour simple linear behaviour in-between training examples.

Implementation

See this notebook for an example

Built with on

Authors

Acknowledgements

Cite this paper when using this library. Arxiv Version

@inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

Licence

MIT licensed. See the bundled LICENCE file for more details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].