All Projects → oscar-corpus → ungoliant

oscar-corpus / ungoliant

Licence: Apache-2.0 license
🕷️ The pipeline for the OSCAR corpus

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to ungoliant

goclassy
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Stars: ✭ 81 (+17.39%)
Mutual labels:  corpus-linguistics, fasttext, common-crawl, language-classification
Text Classification TF
用tf实现各种文本分类模型,并且封装restful接口,可以直接工程化
Stars: ✭ 32 (-53.62%)
Mutual labels:  fasttext
Sentence Classification
Sentence Classifications with Neural Networks
Stars: ✭ 177 (+156.52%)
Mutual labels:  fasttext
kanji-frequency
Kanji usage frequency data collected from various sources
Stars: ✭ 92 (+33.33%)
Mutual labels:  corpus-linguistics
Cw2vec
cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Stars: ✭ 224 (+224.64%)
Mutual labels:  fasttext
kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
Stars: ✭ 50 (-27.54%)
Mutual labels:  corpus-linguistics
Gensim
Topic Modelling for Humans
Stars: ✭ 12,763 (+18397.1%)
Mutual labels:  fasttext
fasttext-serving
Serve your fastText models for text classification and word vectors
Stars: ✭ 21 (-69.57%)
Mutual labels:  fasttext
KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
Stars: ✭ 49 (-28.99%)
Mutual labels:  commoncrawl
nerus
Large silver standart Russian corpus with NER, morphology and syntax markup
Stars: ✭ 47 (-31.88%)
Mutual labels:  corpus-linguistics
CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
Stars: ✭ 26 (-62.32%)
Mutual labels:  corpus-linguistics
Pyfasttext
Yet another Python binding for fastText
Stars: ✭ 229 (+231.88%)
Mutual labels:  fasttext
fasttextjs
JavaScript implementation of the FastText prediction algorithm
Stars: ✭ 31 (-55.07%)
Mutual labels:  fasttext
Shallowlearn
An experiment about re-implementing supervised learning models based on shallow neural network approaches (e.g. fastText) with some additional exclusive features and nice API. Written in Python and fully compatible with Scikit-learn.
Stars: ✭ 196 (+184.06%)
Mutual labels:  fasttext
fastchess
Predicts the best chess move with 27.5% accuracy by a single matrix multiplication
Stars: ✭ 75 (+8.7%)
Mutual labels:  fasttext
Wordvectors
Pre-trained word vectors of 30+ languages
Stars: ✭ 2,043 (+2860.87%)
Mutual labels:  fasttext
Ai law
all kinds of baseline models for long text classificaiton( text categorization)
Stars: ✭ 243 (+252.17%)
Mutual labels:  fasttext
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-76.81%)
Mutual labels:  corpus-linguistics
actions-suggest-related-links
A GitHub Action to suggest related or similar issues, documents, and links. Based on the power of NLP and fastText.
Stars: ✭ 23 (-66.67%)
Mutual labels:  fasttext
Simple-Sentence-Similarity
Exploring the simple sentence similarity measurements using word embeddings
Stars: ✭ 99 (+43.48%)
Mutual labels:  fasttext

Ungoliant

codecov

🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️

It currently is the generation pipeline for OSCAR corpus, from CommonCrawl. Ungoliant is a replacement of goclassy.

Installation

Installing/Compiling the binary

  • Via cargo: cargo install ungoliant
  • Via git: cargo install --git https://github.com/oscar-corpus/ungoliant

Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc can be needed as the project uses fasttext-rs.

Getting the language identification file (for fastText):

Use curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin.

Usage

The usual way of generating corpora is:

  1. Fetch the wet.paths.gz file from the last CommonCrawl dump and decompress it.
  2. Download the files using the download command.
  3. Generate the corpus using the pipeline command (it may take some time).
  4. Deduplicate if needed using the dedup command.
  5. Split into smaller files using the split command.
  6. Compress using compress :-)
  7. package will create language specific folders, move the relevant files in them and put a checksum file.

You can find more information on each command's --help.

ungoliant 0.1.0
corpus generation tool.

USAGE:
    ungoliant <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    compress    Compress
    dedup       Deduplicate a generated, not split corpus.
    download    Downloading of CommonCrawl
    help        Prints this message or the help of the given subcommand(s)
    package     package
    pipeline    Run pipeline
    split       Split a not split corpus

Documentation

Ungoliant is not yet on docs.rs: use cargo doc --bins --open to open the documentation.

Benchmarking

Benchmarking is not (yet) updated. Use cargo bench to run benchmarking. See results in target/criterion/report/index.html

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].