All Projects → oscar-corpus → goclassy

oscar-corpus / goclassy

Licence: Apache-2.0 license
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to goclassy

ungoliant
🕷️ The pipeline for the OSCAR corpus
Stars: ✭ 69 (-14.81%)
Mutual labels:  corpus-linguistics, fasttext, common-crawl, language-classification
nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
Stars: ✭ 47 (-41.98%)
Mutual labels:  corpus-linguistics
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+3861.73%)
Mutual labels:  fasttext
fasttext-serving
Serve your fastText models for text classification and word vectors
Stars: ✭ 21 (-74.07%)
Mutual labels:  fasttext
kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
Stars: ✭ 50 (-38.27%)
Mutual labels:  corpus-linguistics
Ai law
all kinds of baseline models for long text classificaiton( text categorization)
Stars: ✭ 243 (+200%)
Mutual labels:  fasttext
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (+176.54%)
Mutual labels:  fasttext
kanji-frequency
Kanji usage frequency data collected from various sources
Stars: ✭ 92 (+13.58%)
Mutual labels:  corpus-linguistics
german-sentiment
A data set and model for german sentiment classification.
Stars: ✭ 37 (-54.32%)
Mutual labels:  fasttext
actions-suggest-related-links
A GitHub Action to suggest related or similar issues, documents, and links. Based on the power of NLP and fastText.
Stars: ✭ 23 (-71.6%)
Mutual labels:  fasttext
fastchess
Predicts the best chess move with 27.5% accuracy by a single matrix multiplication
Stars: ✭ 75 (-7.41%)
Mutual labels:  fasttext
fasttextjs
JavaScript implementation of the FastText prediction algorithm
Stars: ✭ 31 (-61.73%)
Mutual labels:  fasttext
CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
Stars: ✭ 26 (-67.9%)
Mutual labels:  corpus-linguistics
NLP-paper
🎨 🎨NLP 自然语言处理教程 🎨🎨 https://dataxujing.github.io/NLP-paper/
Stars: ✭ 23 (-71.6%)
Mutual labels:  fasttext
Text Classification TF
用tf实现各种文本分类模型,并且封装restful接口,可以直接工程化
Stars: ✭ 32 (-60.49%)
Mutual labels:  fasttext
Pyfasttext
Yet another Python binding for fastText
Stars: ✭ 229 (+182.72%)
Mutual labels:  fasttext
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-80.25%)
Mutual labels:  corpus-linguistics
compress-fasttext
Tools for shrinking fastText models (in gensim format)
Stars: ✭ 124 (+53.09%)
Mutual labels:  fasttext
fasttext-serving
fastText model serving service
Stars: ✭ 54 (-33.33%)
Mutual labels:  fasttext
fasttext-serverless
Serverless hashtag recommendations using fastText and Python with AWS Lambda
Stars: ✭ 20 (-75.31%)
Mutual labels:  fasttext

goclassy

This is the old OSCAR pipeline, if you are looking for the upcoming pipeline please take a look at the Ungoliant project.

An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.

For more info see our paper here.

If you want to download OSCAR you can do it here.

Note: For the moment the downloader part of the pipeline is not available as it is still experimental, it will be open sourced in a future release.

References

@inproceedings{OrtizSuarezSagotRomary2019,
  author    = {Pedro Javier {Ortiz Su{\'a}rez} and Beno{\^i}t Sagot and Laurent Romary},
  title     = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
  series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
  editor    = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{\"u}ngen and Caroline Iliadi},
  publisher = {Leibniz-Institut f{\"u}r Deutsche Sprache},
  address   = {Mannheim},
  doi       = {10.14618/ids-pub-9021},
  url       = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
  pages     = {9 -- 16},
  year      = {2019},
  abstract  = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
  language  = {en}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].