All Projects → josecannete → spanish-corpora

josecannete / spanish-corpora

Licence: MIT License
Unannotated Spanish 3 Billion Words Corpora

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to spanish-corpora

OpenGNT
Open Greek New Testament Project; NA28 / NA27 Equivalent Text & Resources
Stars: ✭ 55 (-9.84%)
Mutual labels:  linguistics, spanish
verbecc
Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian
Stars: ✭ 45 (-26.23%)
Mutual labels:  linguistics, spanish-language
TextDatasetCleaner
🔬 Очистка датасетов от мусора (нормализация, препроцессинг)
Stars: ✭ 27 (-55.74%)
Mutual labels:  linguistics
wikipron
Massively multilingual pronunciation mining
Stars: ✭ 167 (+173.77%)
Mutual labels:  linguistics
duree
Durée: the longest book ever written.
Stars: ✭ 67 (+9.84%)
Mutual labels:  linguistics
the-road-to-learn-react-spanish
The Road to learn React - Spanish Translation
Stars: ✭ 57 (-6.56%)
Mutual labels:  spanish
now-course
Proyecto para el curso de Now.sh en Platzi
Stars: ✭ 19 (-68.85%)
Mutual labels:  spanish
neural-net-linguistics
Papers about NN and linguistics
Stars: ✭ 14 (-77.05%)
Mutual labels:  linguistics
number-to-words
convert number into words (english, french, italian, roman, spanish, portuguese, belgium, dutch, swedish, polish, russian, iranian, roman, aegean)
Stars: ✭ 53 (-13.11%)
Mutual labels:  spanish
folia
FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for proces…
Stars: ✭ 56 (-8.2%)
Mutual labels:  linguistics
huner
Named Entity Recognition for biomedical entities
Stars: ✭ 44 (-27.87%)
Mutual labels:  corpora
clinical nlp elastic
Clinical NLP Analysis with Elasticsearch and Kibana
Stars: ✭ 32 (-47.54%)
Mutual labels:  linguistics
mystem
CGo bindings to Yandex.Mystem
Stars: ✭ 28 (-54.1%)
Mutual labels:  linguistics
CorpusLoaders.jl
A variety of loaders for various NLP corpora.
Stars: ✭ 28 (-54.1%)
Mutual labels:  corpora
TextGridTools
Read, write, and manipulate Praat TextGrid files with Python
Stars: ✭ 84 (+37.7%)
Mutual labels:  linguistics
webempresa
Repositorio de la Web Empresarial del curso Django (revisado en la versión 4.0.2 con Python 3.10.2)
Stars: ✭ 17 (-72.13%)
Mutual labels:  spanish
lorca
Natural Language Processing for Spanish in Node.js. Stemmer, sentiment analysis, readability, tf-idf with batteries, concordance and more!
Stars: ✭ 95 (+55.74%)
Mutual labels:  spanish
angular-hispano
La Comunidad Angular hispanohablante
Stars: ✭ 37 (-39.34%)
Mutual labels:  spanish
treebender
A HDPSG-inspired symbolic natural language parser written in Rust
Stars: ✭ 24 (-60.66%)
Mutual labels:  linguistics
concepticon-data
The curation repository for the data behind Concepticon.
Stars: ✭ 25 (-59.02%)
Mutual labels:  linguistics

Spanish Unannotated Corpora

DOI

This repository gathers a compilation of corpus in Spanish language. Available to download here: Zenodo

Data

Number of lines: 300904000 (300M)

Number of tokens: 2996016962 (3B)

Number of chars: 18431160978 (18.4B)

Sources

Spanish Wikis: Wich include Wikipedia, Wikinews, Wikiquotes and more. These were first processed with wikiextractor (https://github.com/josecannete/wikiextractorforBERT) using the wikis dump of 20/04/2019.

ParaCrawl: Spanish portion of ParaCrawl (http://opus.nlpl.eu/ParaCrawl.php)

EUBookshop: Spanish portion of EUBookshop (http://opus.nlpl.eu/EUbookshop.php)

MultiUN: Spanish portion of MultiUN (http://opus.nlpl.eu/MultiUN.php)

OpenSubtitles: Spanish portion of OpenSubtitles2018 (http://opus.nlpl.eu/OpenSubtitles-v2018.php)

DGC: Spanish portion of DGT (http://opus.nlpl.eu/DGT.php)

DOGC: Spanish portion of DOGC (http://opus.nlpl.eu/DOGC.php)

ECB: Spanish portion of ECB (http://opus.nlpl.eu/ECB.php)

EMEA: Spanish portion of EMEA (http://opus.nlpl.eu/EMEA.php)

Europarl: Spanish portion of Europarl (http://opus.nlpl.eu/Europarl.php)

GlobalVoices: Spanish portion of GlobalVoices (http://opus.nlpl.eu/GlobalVoices.php)

JRC: Spanish portion of JRC (http://opus.nlpl.eu/JRC-Acquis.php)

News-Commentary11: Spanish portion of NCv11 (http://opus.nlpl.eu/News-Commentary-v11.php)

TED: Spanish portion of TED (http://opus.nlpl.eu/TED2013.php)

UN: Spanish portion of UN (http://opus.nlpl.eu/UN.php)

Post-processing

Two post-processing scripts included (corpus_processing.py and split_punctuation.py). The available data was processed just with the first one.

Using process_corpus.py:

  • Lowercase
  • Removed urls
  • Removed listing
  • Replaced multiple spaces with single one
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].