oscar-corpus / goclassy

Licence: Apache-2.0 license

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

Programming Languages

31211 projects - #10 most used programming language

Projects that are alternatives of or similar to goclassy

ungoliant

🕷️ The pipeline for the OSCAR corpus

Stars: ✭ 69 (-14.81%)

Mutual labels: corpus-linguistics, fasttext, common-crawl, language-classification

nerus

Large silver standart Russian corpus with NER, morphology and syntax markup

Stars: ✭ 47 (-41.98%)

Mutual labels: corpus-linguistics

Pytorch Sentiment Analysis

Tutorials on getting started with PyTorch and TorchText for sentiment analysis.

Stars: ✭ 3,209 (+3861.73%)

Mutual labels: fasttext

fasttext-serving

Serve your fastText models for text classification and word vectors

Stars: ✭ 21 (-74.07%)

Mutual labels: fasttext

kontext

An advanced, extensible web front-end for the Manatee-open corpus search engine

Stars: ✭ 50 (-38.27%)

Mutual labels: corpus-linguistics

Ai law

all kinds of baseline models for long text classificaiton( text categorization)

Stars: ✭ 243 (+200%)

Mutual labels: fasttext

Cw2vec

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

Stars: ✭ 224 (+176.54%)

Mutual labels: fasttext

kanji-frequency

Kanji usage frequency data collected from various sources

Stars: ✭ 92 (+13.58%)

Mutual labels: corpus-linguistics

german-sentiment

A data set and model for german sentiment classification.

Stars: ✭ 37 (-54.32%)

Mutual labels: fasttext

actions-suggest-related-links

A GitHub Action to suggest related or similar issues, documents, and links. Based on the power of NLP and fastText.

Stars: ✭ 23 (-71.6%)

Mutual labels: fasttext

fastchess

Predicts the best chess move with 27.5% accuracy by a single matrix multiplication

Stars: ✭ 75 (-7.41%)

Mutual labels: fasttext

fasttextjs

JavaScript implementation of the FastText prediction algorithm

Stars: ✭ 31 (-61.73%)

Mutual labels: fasttext

CogNet

CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates

Stars: ✭ 26 (-67.9%)

Mutual labels: corpus-linguistics

NLP-paper

🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/

Stars: ✭ 23 (-71.6%)

Mutual labels: fasttext

Text Classification TF

用tf实现各种文本分类模型，并且封装restful接口，可以直接工程化

Stars: ✭ 32 (-60.49%)

Mutual labels: fasttext

Pyfasttext

Yet another Python binding for fastText

Stars: ✭ 229 (+182.72%)

Mutual labels: fasttext

corpusexplorer2.0

Korpuslinguistik war noch nie so einfach...

Stars: ✭ 16 (-80.25%)

Mutual labels: corpus-linguistics

compress-fasttext

Tools for shrinking fastText models (in gensim format)

Stars: ✭ 124 (+53.09%)

Mutual labels: fasttext

fasttext-serving

fastText model serving service

Stars: ✭ 54 (-33.33%)

Mutual labels: fasttext

fasttext-serverless

Serverless hashtag recommendations using fastText and Python with AWS Lambda

Stars: ✭ 20 (-75.31%)

Mutual labels: fasttext

View All Similar Projects ➔

goclassy

This is the old OSCAR pipeline, if you are looking for the upcoming pipeline please take a look at the Ungoliant project.

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

For more info see our paper here.

If you want to download OSCAR you can do it here.

Note: For the moment the downloader part of the pipeline is not available as it is still experimental, it will be open sourced in a future release.

References

@inproceedings{OrtizSuarezSagotRomary2019,
  author    = {Pedro Javier {Ortiz Su{\'a}rez} and Beno{\^i}t Sagot and Laurent Romary},
  title     = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
  series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
  editor    = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{\"u}ngen and Caroline Iliadi},
  publisher = {Leibniz-Institut f{\"u}r Deutsche Sprache},
  address   = {Mannheim},
  doi       = {10.14618/ids-pub-9021},
  url       = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
  pages     = {9 -- 16},
  year      = {2019},
  abstract  = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
  language  = {en}
}

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

oscar-corpus / goclassy

Programming Languages

Labels

Projects that are alternatives of or similar to goclassy

goclassy

References