All Projects → chakki-works → Chazutsu

chakki-works / Chazutsu

Licence: apache-2.0
The tool to make NLP datasets ready to use

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Chazutsu

Mtnt
Code for the collection and analysis of the MTNT dataset
Stars: ✭ 48 (-79.83%)
Mutual labels:  dataset, natural-language-processing
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-59.66%)
Mutual labels:  dataset, natural-language-processing
Coarij
Corpus of Annual Reports in Japan
Stars: ✭ 55 (-76.89%)
Mutual labels:  dataset, natural-language-processing
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (+128.15%)
Mutual labels:  dataset, natural-language-processing
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (-41.6%)
Mutual labels:  dataset, natural-language-processing
Insuranceqa Corpus Zh
🚁 保险行业语料库,聊天机器人
Stars: ✭ 821 (+244.96%)
Mutual labels:  dataset, natural-language-processing
Pytreebank
😡😇 Stanford Sentiment Treebank loader in Python
Stars: ✭ 93 (-60.92%)
Mutual labels:  dataset, natural-language-processing
Fakenewscorpus
A dataset of millions of news articles scraped from a curated list of data sources.
Stars: ✭ 255 (+7.14%)
Mutual labels:  dataset, natural-language-processing
Mams For Absa
A Multi-Aspect Multi-Sentiment Dataset for aspect-based sentiment analysis.
Stars: ✭ 135 (-43.28%)
Mutual labels:  dataset, natural-language-processing
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-49.16%)
Mutual labels:  dataset, natural-language-processing
Doccano
Open source annotation tool for machine learning practitioners.
Stars: ✭ 5,600 (+2252.94%)
Mutual labels:  dataset, natural-language-processing
Nlp bahasa resources
A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia
Stars: ✭ 158 (-33.61%)
Mutual labels:  dataset, natural-language-processing
Text2sql Data
A collection of datasets that pair questions with SQL queries.
Stars: ✭ 287 (+20.59%)
Mutual labels:  dataset, natural-language-processing
Wikisql
A large annotated semantic parsing corpus for developing natural language interfaces.
Stars: ✭ 965 (+305.46%)
Mutual labels:  dataset, natural-language-processing
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (+18.91%)
Mutual labels:  dataset, natural-language-processing
Char Rnn Tensorflow
Multi-layer Recurrent Neural Networks for character-level language models implements by TensorFlow
Stars: ✭ 58 (-75.63%)
Mutual labels:  dataset, natural-language-processing
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (-54.62%)
Mutual labels:  dataset, natural-language-processing
Pytorch Nlp
Basic Utilities for PyTorch Natural Language Processing (NLP)
Stars: ✭ 1,996 (+738.66%)
Mutual labels:  dataset, natural-language-processing
Korean Hate Speech
Korean HateSpeech Dataset
Stars: ✭ 192 (-19.33%)
Mutual labels:  dataset, natural-language-processing
Stocknet Dataset
A comprehensive dataset for stock movement prediction from tweets and historical stock prices.
Stars: ✭ 228 (-4.2%)
Mutual labels:  dataset

chazutsu

chazutsu_top.PNG
photo from Kaikado, traditional Japanese chazutsu maker

PyPI version Build Status codecov

chazutsu is the dataset downloader for NLP.

>>> import chazutsu
>>> r = chazutsu.datasets.IMDB().download()
>>> r.train_data().head(5)

Then

   polarity  rating                                             review
0         0       3  You'd think the first landing on the Moon woul...
1         1       9  I took a flyer in renting this movie but I got...
2         1      10  Sometimes I just want to laugh. Don't you? No ...
3         0       2  I knew it wasn't gunna work out between me and...
4         0       2  Sometimes I rest my head and think about the r...

You can use chazutsu on Jupyter.

Install

pip install chazutsu

Supported datasetd

chazutsu supports various kinds of datasets!
Please see the details here!

  • Sentiment Analysis
    • Movie Review Data
    • Customer Review Datasets
    • Large Movie Review Dataset(IMDB)
  • Text classification
    • 20 Newsgroups
    • Reuters News Courpus (RCV1-v2)
  • Language Modeling
    • Penn Tree Bank
    • WikiText-2
    • WikiText-103
    • text8
  • Text Summarization
    • DUC2003
    • DUC2004
    • Gigaword
  • Textual entailment
    • The Multi-Genre Natural Language Inference (MultiNLI)
  • Question Answering
    • The Stanford Question Answering Dataset (SQuAD)

How it works

chazutsu not only download the dataset, but execute expand archive file, shuffle, split, picking samples process also (You can disable the process by arguments if you don't need).

chazutsu_process1.png

r = chazutsu.datasets.MovieReview.polarity(shuffle=False, test_size=0.3, sample_count=100).download()
  • shuffle: The flag argument for executing shuffle or not(True/False).
  • test_size: The ratio of the test dataset (If dataset already prepares train and test dataset, this value is ignored).
  • sample_count: You can pick some samples from the dataset to avoid the editor freeze caused by the heavy text file.
  • force: Don't use cache, re-download the dataset.

chazutsu supports fundamental process for tokenization.

chazutsu_process2.png

>>> import chazutsu
>>> r = chazutsu.datasets.MovieReview.subjectivity().download()
>>> r.train_data().head(3)

Then

    subjectivity                                             review
0             0  . . . works on some levels and is certainly wo...
1             1  the hulk is an anger fueled monster with incre...
2             1  when the skittish emma finds blood on her pill...

Now we want to convert this data to train various frameworks.

fixed_len = 10
r.make_vocab(vocab_size=1000)
r.column("review").as_word_seq(fixed_len=fixed_len)
X, y = r.to_batch("train")
assert X.shape == (len(y), fixed_len, len(r.vocab))
assert y.shape == (len(y), 1)
  • make_vocab
    • vocab_resources: resources to make vocabulary ("train", "valid", "test")
    • columns_for_vocab: The columns to make vocabulary
    • tokenizer: Tokenizer
    • vocab_size: Vocacbulary size
    • min_word_freq: Minimum word count to include the vocabulary
    • unknown: The tag used for out of vocabulary word
    • padding: The tag used to pad the sequence
    • end_of_sentence: If you want to clarify the end-of-line by specific tag, then use this.
    • reserved_words: The word that should included in vocabulary (ex. tag for padding)
    • force: Don't use cache, re-create the dataset.

If you don't want to load all the training data? You can use to_batch_iter.

Additional Feature

Use on Jupyter

You can use chazutsu on Jupyter Notebook.

on_jupyter.png

Before you execute chazutsu on Jupyter, you have to enable widget extention by below command.

jupyter nbextension enable --py --sys-prefix widgetsnbextension
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].