All Projects → JuliaText → CorpusLoaders.jl

JuliaText / CorpusLoaders.jl

Licence: other
A variety of loaders for various NLP corpora.

Programming Languages

julia
2034 projects

Projects that are alternatives of or similar to CorpusLoaders.jl

kontext
An advanced, extensible web front-end for the Manatee-open corpus search engine
Stars: ✭ 50 (+78.57%)
Mutual labels:  corpora
parallel-corpora-tools
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Stars: ✭ 35 (+25%)
Mutual labels:  corpora
CrossNER
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)
Stars: ✭ 87 (+210.71%)
Mutual labels:  corpora
Open-korean-corpora
Open Korean NLP Dataset Curation for the Users All Around the Globe
Stars: ✭ 82 (+192.86%)
Mutual labels:  corpora
spanish-corpora
Unannotated Spanish 3 Billion Words Corpora
Stars: ✭ 61 (+117.86%)
Mutual labels:  corpora
huner
Named Entity Recognition for biomedical entities
Stars: ✭ 44 (+57.14%)
Mutual labels:  corpora

CorpusLoaders

A collection of various means for loading various different corpora used in NLP.

Installation

As per the standard Julia package installation:

julia> Pkg.add("CorpusLoaders")

Also, in the Pkg REPL, the package can be added with the add command, as:

pkg> add CorpusLoaders

Common Structure

For some corpus which we will say has type Corpus, it will have a constructior Corpus(path) where path argument is a path to the files describing it. That path will default to a predefined data dependency, if not provided. The data dependency will be downloaded the first time you call Corpus(). When the datadep resolves it will give full bibliograpghic details on the corpus etc. For more on that like configuration details, see DataDeps.jl.

Each corpus has a function load(::Corpus). This will return some iterator of data. It is often lazy, e.g. using a Channel, as many corpora are too large to fit in memory comfortably. It will often be an iterator of iterators of iterators ... Designed to be manipulated by using MultiResolutionIterators.jl. The corpus type is an indexer for using named levels with MultiResolutionInterators.jl. so lvls(Corpus, :para) works.

Corpora

Follow the links below for full docs on the usage of the corpora.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].