Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → tofunlp → Lineflow

tofunlp / Lineflow

Licence: mit

⚡️A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

Programming Languages

139335 projects - #7 most used programming language

Labels

deep-learning machine-learning natural-language-processing

Projects that are alternatives of or similar to Lineflow

SLING - A natural language frame semantics parser

Stars: ✭ 1,892 (+1026.19%)

Mutual labels: natural-language-processing

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Stars: ✭ 159 (-5.36%)

Mutual labels: natural-language-processing

Paper list of NLP for recommender systems

Stars: ✭ 162 (-3.57%)

Mutual labels: natural-language-processing

📖 A curated list of resources dedicated to Natural Language Processing (NLP)

Stars: ✭ 12,626 (+7415.48%)

Mutual labels: natural-language-processing

《机器翻译：基础与模型》肖桐朱靖波著 - Machine Translation: Foundations and Models

Stars: ✭ 2,307 (+1273.21%)

Mutual labels: natural-language-processing

Ngx Dynamic Dashboard Framework

This is a JSON driven angular x based dashboard framework that is inspired by JIRA's dashboard implementation and https://github.com/raulgomis/angular-dashboard-framework

Stars: ✭ 160 (-4.76%)

Mutual labels: natural-language-processing

📅🇨🇳 中国法定节假日数据自动每日抓取国务院公告

Stars: ✭ 157 (-6.55%)

Mutual labels: natural-language-processing

Turkish Stemmer Python

🐍 Turkish Language Stemmer for Python

Stars: ✭ 165 (-1.79%)

Mutual labels: natural-language-processing

Basic Utilities for PyTorch Natural Language Processing (NLP)

Stars: ✭ 1,996 (+1088.1%)

Mutual labels: natural-language-processing

Library to scrape and clean web pages to create massive datasets.

Stars: ✭ 1,985 (+1081.55%)

Mutual labels: natural-language-processing

Topic Modelling for Humans

Stars: ✭ 12,763 (+7497.02%)

Mutual labels: natural-language-processing

Mishkal is an arabic text vocalization software

Stars: ✭ 158 (-5.95%)

Mutual labels: natural-language-processing

Covid Papers Browser

Browse Covid-19 & SARS-CoV-2 Scientific Papers with Transformers 🦠 📖

Stars: ✭ 161 (-4.17%)

Mutual labels: natural-language-processing

Awesome Pytorch List

A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.

Stars: ✭ 12,475 (+7325.6%)

Mutual labels: natural-language-processing

Newsrecommender

A news recommendation system tailored for user communities

Stars: ✭ 164 (-2.38%)

Mutual labels: natural-language-processing

PyTorch code for Learning Cooperative Visual Dialog Agents using Deep Reinforcement Learning

Stars: ✭ 157 (-6.55%)

Mutual labels: natural-language-processing

Nlp bahasa resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Stars: ✭ 158 (-5.95%)

Mutual labels: natural-language-processing

Question generation

It is a question-generator model. It takes text and an answer as input and outputs a question.

Stars: ✭ 166 (-1.19%)

Mutual labels: natural-language-processing

Amacımız Türkçe NLP literatüründeki birçok farklı sorunu bir arada çözebilen, eşsiz yaklaşımlar öne süren ve literatürdeki çalışmaların eksiklerini gideren open source bir yazım destekleyicisi/denetleyicisi oluşturmak. Kullanıcıların yazdıkları metinlerdeki yazım yanlışlarını derin öğrenme yaklaşımıyla çözüp aynı zamanda metinlerde anlamsal analizi de gerçekleştirerek bu bağlamda ortaya çıkan yanlışları da fark edip düzeltebilmek.

Stars: ✭ 165 (-1.79%)

Mutual labels: natural-language-processing

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

Stars: ✭ 160 (-4.76%)

Mutual labels: natural-language-processing

View All Similar Projects ➔

LineFlow: Framework-Agnostic NLP Data Loader in Python

LineFlow is a simple text dataset loader for NLP deep learning tasks.

LineFlow was designed to use in all deep learning frameworks.
LineFlow enables you to build pipelines via functional APIs (.map, .filter, .flat_map).
LineFlow provides common NLP datasets.

LineFlow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3
ds.map(lambda x: x.split()).first()  # ["i", "'m", "a", "line", "1", "."]

Example

Please check out the examples to see how to use LineFlow, especially for tokenization, building vocabulary, and indexing.

Loads Penn Treebank:

>>> import lineflow.datasets as lfds
>>> train = lfds.PennTreebank('train')
>>> train.first()
' aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter '

Splits the sentence to the words:

>>> # continuing from above
>>> train = train.map(str.split)
>>> train.first()
['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter']

Obtains words in dataset:

>>> # continuing from above
>>> words = train.flat_map(lambda x: x)
>>> words.take(5) # This is useful to build vocabulary.
['aer', 'banknote', 'berlitz', 'calloway', 'centrust']

Further more:

How to fine-tune BERT with pytorch-lightning by @sobamchan

Requirements

Python3.6+

Installation

To install LineFlow:

pip install lineflow

Datasets

Is the dataset you want to use not supported? Suggest a new dataset 🎉

Commonsense Reasoning
Language Modeling
Machine Translation
Paraphrase
Question Answering
Sentiment Analysis
Sequence Tagging
Text Summarization

Commonsense Reasoning

CommonsenseQA

Loads the CommonsenseQA dataset:

>>> import lineflow.datasets as lfds

>>> train = lfds.CommonsenseQA("train")
>>> dev = lfds.CommonsenseQA("dev")
>>> test = lfds.CommonsenseQA("test")

The items in this datset as follows:

>>> import lineflow.datasets as lfds

>>> train = lfds.CommonsenseQA("train")
>>> train.first()
{"id": "075e483d21c29a511267ef62bedc0461",
 "answer_key": "A",
 "options": {"A": "ignore",
 "B": "enforce",
 "C": "authoritarian",
 "D": "yell at",
 "E": "avoid"},
 "stem": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?"}
}

Language Modeling

Penn Treebank

Loads the Penn Treebank dataset:

import lineflow.datasets as lfds

train = lfds.PennTreebank('train')
dev = lfds.PennTreebank('dev')
test = lfds.PennTreebank('test')

WikiText-103

Loads the WikiText-103 dataset:

import lineflow.datasets as lfds

train = lfds.WikiText103('train')
dev = lfds.WikiText103('dev')
test = lfds.WikiText103('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.WikiText103('train').flat_map(lambda x: x.split() + ['<eos>'])
>>> train.take(5)
['<eos>', '=', 'Valkyria', 'Chronicles', 'III']

WikiText-2 (Added by @sobamchan, thanks.)

Loads the WikiText-2 dataset:

import lineflow.datasets as lfds

train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.WikiText2('train').flat_map(lambda x: x.split() + ['<eos>'])
>>> train.take(5)
['<eos>', '=', 'Valkyria', 'Chronicles', 'III']

Machine Translation

small_parallel_enja:

Loads the small_parallel_enja dataset which is small English-Japanese parallel corpus:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.SmallParallelEnJa('train').map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
(['i', 'can', "'t", 'tell', 'who', 'will', 'arrive', 'first', '.'], ['誰', 'が', '一番', 'に', '着', 'く', 'か', '私', 'に', 'は', '分か', 'り', 'ま', 'せ', 'ん', '。']

WMT 14

Loads the WMT14 dataset:

import lineflow.datasets as lfds

train = lfds.Wmt14('train')
dev = lfds.Wmt14('dev')
test = lfd.Wmt14('test')

Paraphrase

Microsoft Research Paraphrase Corpus:

Loads the Miscrosoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.MsrParaphrase('train')
>>> train.first()
{'quality': '1',
 'id1': '702876',
 'id2': '702977',
 'string1': 'Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.',
 'string2': 'Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.'
}

Question Answering

Loads the SQuAD dataset:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.Squad('train')
>>> train.first()
{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'}

Sentiment Analysis

IMDB:

Loads the IMDB dataset:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.Imdb('train')
>>> train.first()
('For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.', 0)

Sequence Tagging

CoNLL2000

Loads the CoNLL2000 dataset:

import lineflow.datasets as lfds

train = lfds.Conll2000('train')
test = lfds.Conll2000('test')

Text Summarization

CNN / Daily Mail:

Loads the CNN / Daily Mail dataset:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.CnnDailymail('train').map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
... # the output is omitted because it's too long to display here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 168

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗