Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → airbnb → Artificial Adversary

airbnb / Artificial Adversary

Licence: mit

🗣️ Tool to generate adversarial text examples and test machine learning models against them

Programming Languages

python

139335 projects - #7 most used programming language

python3

1442 projects

python2

120 projects

Labels

machine-learning data-science metrics classification text-classification data-mining text text-mining text-processing text-analysis spam

Projects that are alternatives of or similar to Artificial Adversary

support-tickets-classification

This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en

Stars: ✭ 142 (-59.2%)

Mutual labels: text-mining, text-classification, text-analysis, classification, text-processing

Text mining resources

Resources for learning about Text Mining and Natural Language Processing

Stars: ✭ 358 (+2.87%)

Mutual labels: data-mining, text-classification, text-mining, text-analysis

corpusexplorer2.0

Korpuslinguistik war noch nie so einfach...

Stars: ✭ 16 (-95.4%)

Mutual labels: text-mining, data-mining, text-analysis, text-processing

Awesome Text Classification

Awesome-Text-Classification Projects,Papers,Tutorial .

Stars: ✭ 158 (-54.6%)

Mutual labels: classification, text-classification, text-mining, text-analysis

Rmdl

RMDL: Random Multimodel Deep Learning for Classification

Stars: ✭ 375 (+7.76%)

Mutual labels: classification, data-mining, text-classification, text-mining

Applied Text Mining In Python

Repo for Applied Text Mining in Python (coursera) by University of Michigan

Stars: ✭ 59 (-83.05%)

Mutual labels: classification, text-classification, text-mining, text-processing

Fake news detection

Fake News Detection in Python

Stars: ✭ 194 (-44.25%)

Mutual labels: classification, text-classification, text-mining, text-analysis

kwx

BERT, LDA, and TFIDF based keyword extraction in Python

Stars: ✭ 33 (-90.52%)

Mutual labels: text-mining, text-classification, text-analysis

Data Science Toolkit

Collection of stats, modeling, and data science tools in Python and R.

Stars: ✭ 169 (-51.44%)

Mutual labels: data-science, classification, data-mining

DaDengAndHisPython

【微信公众号：大邓和他的python】, Python语法快速入门https://www.bilibili.com/video/av44384851 Python网络爬虫快速入门https://www.bilibili.com/video/av72010301, 我的联系邮箱[email protected]

Stars: ✭ 59 (-83.05%)

Mutual labels: text-mining, text-classification, text-analysis

text-analysis

Weaving analytical stories from text data

Stars: ✭ 12 (-96.55%)

Mutual labels: text-mining, text-analysis, text-processing

R Text Data

List of textual data sources to be used for text mining in R

Stars: ✭ 85 (-75.57%)

Mutual labels: data-science, text-mining, text-analysis

Php Ml

PHP-ML - Machine Learning library for PHP

Stars: ✭ 7,900 (+2170.11%)

Mutual labels: data-science, classification, data-mining

Automlpipeline.jl

A package that makes it trivial to create and evaluate machine learning pipeline architectures.

Stars: ✭ 223 (-35.92%)

Mutual labels: data-science, classification, data-mining

perke

A keyphrase extractor for Persian

Stars: ✭ 60 (-82.76%)

Mutual labels: text-mining, data-mining, text-processing

Gwu data mining

Materials for GWU DNSC 6279 and DNSC 6290.

Stars: ✭ 217 (-37.64%)

Mutual labels: data-science, data-mining, text-mining

Pycm

Multi-class confusion matrix library in Python

Stars: ✭ 1,076 (+209.2%)

Mutual labels: data-science, classification, data-mining

Orange3

🍊 📊 💡 Orange: Interactive data analysis

Stars: ✭ 3,152 (+805.75%)

Mutual labels: data-science, classification, data-mining

Text-Classification-LSTMs-PyTorch

The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Stars: ✭ 45 (-87.07%)

Mutual labels: text-mining, text-classification, text-processing

teanaps

자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.

Stars: ✭ 91 (-73.85%)

Mutual labels: text-mining, data-mining, text-processing

View All Similar Projects ➔

This repo is primarily maintained by Devin Soni and Philbert Lin.

Note that this project is under active development. If you encounter bugs, please report them in the issues tab.

Introduction

When classifying user-generated text, there are many ways that users can modify their content to avoid detection. These methods are typically cosmetic modifications to the texts that change the raw characters or words used, but leave the original meaning visible enough for human readers to understand. Such methods include replacing characters with similar looking ones, removing or adding punctuation and spacing, and swapping letters in words. For example please wire me 10,000 US DOLLARS to bank of scamland is probably an obvious scam message, but [email protected] me 10000 US DoLars to,BANK of ScamIand would fool many classifiers.

This library allows you to generate texts using these methods, and simulate these kind of attacks on your machine learning models. By exposing your model to these texts offline, you will be able to better prepare for them when you encounter them in an online setting. Compared to other libraries, this one differs in that it treats the model as a black box and uses only generic attacks that do not depend on knowledge of the model itself.

Installation

pip install Adversary
python -m textblob.download_corpora

Usage

See Example.ipynb for a quick illustrative example.

from Adversary import Adversary
gen = Adversary(verbose=True, output='Output/')
texts_original = ['tell me awful things']
texts_generated = gen.generate(texts_original)
metrics_single, metrics_group = gen.attack(texts_original, texts_generated, lambda x: 1)

Use cases:

1) For data-set augmentation: In order to prepare for these attacks in the wild, an obvious method is to train on examples that are close in nature to the expected attacks. Training on adversarial examples has become a standard technique, and it has been shown to produce more robust classifiers. Using the texts generated by this library will allow you to build resilient models that can handle obfuscation of input text.

2) For performance bounds: If you do not want to alter an existing model, this library will allow you to obtain performance expectations under each possible type of attack.

Included attacks:

Text-level:
- Adding generic words to mask the suspicious parts (good_word_attack)
- Swapping words (swap_words)
- Removing spacing between words (remove_spacing)
Word-level:
- Replacing words with synonyms (synonym)
- Replacing letters with similar-looking symbols (letter_to_symbol)
- Swapping letters (swap_letters)
- Inserting punctuation (insert_punctuation)
- Inserting duplicate characters (insert_duplicate_characters)
- Deleting characters (delete_characters)
- Changing case (change_case)
- Replacing digits with words (num_to_word)

Interface:

Constructor

Adversary(
    verbose=False, 
    output=None
)

verbose: If verbose, prints output while generating texts and while conducting attack
output: If output, pickles generated texts and metrics DataFrames to folder at output path

Returns: None

Note: only provide instances of the positive class in the below functions.

Generate attacked texts

Adversary.generate(
    texts,
    text_sample_rate=1.0,
    word_sample_rate=0.3,
    attacks='all',
    max_attacks=2,
    random_seed=None,
    save=False
)

texts: List of original strings
text_sample_rate: P(individual text is attacked) if in [0, 1], else, number of copies of each text to use
word_sample_rate: P(word_i is sampled in a given word attack | word's text is sampled)
attacks: Description of attack configuration - either 'all', list of str corresponding to attack names, or dict of attack name to probability
max_attacks: Maximum number of attacks that can be applied to a single text
random_seed: Seed for calls to random module functions
save: Whether the generated texts should be pickled as output

Returns: List of tuples of generated strings in format (attacked text, list of attacks, index of original text).

Due to the probabilistic sampling and length heuristics used in certain attacks, some of the generated texts may not differ from the original.

Simulate attack on texts

Adversary.attack(
    texts_original, 
    texts_generated, 
    predict_function, 
    save=False
)

texts_original: List of original texts
texts_generated: List of generated texts (output of generate function)
predict_function: Function that maps str input text to int classification label (0 or 1) - this probably wraps a machine learning model's predict function
save: Whether the generated metrics DataFrames should be pickled as output

Returns: Tuple of two DataFrames containing performance metrics (single attacks, and grouped attacks, respectively)

Contributing

Check the issues tab on GitHub for outstanding issues. Otherwise, feel free to add new attacks in attacks.py or other features in a pull request and the maintainers will look through them. Please make sure you pass the CI checks and add tests if applicable.

Acknowledgments

Credits to Airbnb for giving me the freedom to create this tool during my internship, and Jack Dai for the (obvious in hindsight) name for the project.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 348

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (6) 🔗