All Projects → ARBML → masader

ARBML / masader

Licence: other
The largest public catalogue for Arabic NLP and speech datasets. There are +250 datasets annotated with more than 25 attributes.

Programming Languages

javascript
184084 projects - #8 most used programming language
CSS
56736 projects
HTML
75241 projects
ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to masader

json2python-models
Generate Python model classes (pydantic, attrs, dataclasses) based on JSON datasets with typing module support
Stars: ✭ 119 (+80.3%)
Mutual labels:  datasets
parlitools
A collection of useful tools for UK politics
Stars: ✭ 22 (-66.67%)
Mutual labels:  datasets
tajmeeaton
تجميعة من المشاريع، وخصوصا مفتوحة المصدر، للنهوض باللغة العربية والأمة. 👨‍💻 👨‍🔬👨‍🏫🧕
Stars: ✭ 115 (+74.24%)
Mutual labels:  arabic-nlp
Public-Method-CardGame-NiuNiu
纸牌游戏牛牛的最优算法及Method
Stars: ✭ 21 (-68.18%)
Mutual labels:  public
newt
Natural World Tasks
Stars: ✭ 24 (-63.64%)
Mutual labels:  datasets
dataset
dataset is a command line tool, Go package, shared library and Python package for working with JSON objects as collections
Stars: ✭ 21 (-68.18%)
Mutual labels:  datasets
kneaddata
Quality control tool on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments.
Stars: ✭ 52 (-21.21%)
Mutual labels:  public
Clustering-Datasets
This repository contains the collection of UCI (real-life) datasets and Synthetic (artificial) datasets (with cluster labels and MATLAB files) ready to use with clustering algorithms.
Stars: ✭ 189 (+186.36%)
Mutual labels:  datasets
Dataset-Sentimen-Analisis-Bahasa-Indonesia
Repositori ini merupakan kumpulan dataset terkait analisis sentimen Berbahasa Indonesia. Apabila Anda menggunakan dataset-dataset yang ada pada repositori ini untuk penelitian, maka cantumkanlah/kutiplah jurnal artikel terkait dataset tersebut. Dataset yang tersedia telah diimplementasikan dalam beberapa penelitian dan hasilnya telah dipublikasi…
Stars: ✭ 38 (-42.42%)
Mutual labels:  datasets
Few-Shot-Intent-Detection
Few-Shot-Intent-Detection includes popular challenging intent detection datasets with/without OOS queries and state-of-the-art baselines and results.
Stars: ✭ 63 (-4.55%)
Mutual labels:  datasets
datasets
The primary repository for all of the CORGIS Datasets
Stars: ✭ 19 (-71.21%)
Mutual labels:  datasets
time-series-classification
Classifying time series using feature extraction
Stars: ✭ 75 (+13.64%)
Mutual labels:  datasets
ml-datasets
🌊 Machine learning dataset loaders for testing and example scripts
Stars: ✭ 40 (-39.39%)
Mutual labels:  datasets
spectrochempy
SpectroChemPy is a framework for processing, analyzing and modeling spectroscopic data for chemistry with Python
Stars: ✭ 34 (-48.48%)
Mutual labels:  datasets
akshare
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
Stars: ✭ 5,155 (+7710.61%)
Mutual labels:  datasets
multi-task-defocus-deblurring-dual-pixel-nimat
Reference github repository for the paper "Improving Single-Image Defocus Deblurring: How Dual-Pixel Images Help Through Multi-Task Learning". We propose a single-image deblurring network that incorporates the two sub-aperture views into a multitask framework. Specifically, we show that jointly learning to predict the two DP views from a single …
Stars: ✭ 29 (-56.06%)
Mutual labels:  datasets
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Stars: ✭ 13,870 (+20915.15%)
Mutual labels:  datasets
11K-Hands
Two-stream CNN for gender classification and biometric identification using a dataset of 11K hand images.
Stars: ✭ 44 (-33.33%)
Mutual labels:  datasets
covid19-datasets
A list of high quality open datasets for COVID-19 data analysis
Stars: ✭ 56 (-15.15%)
Mutual labels:  datasets
awesome-forests
🌳 A curated list of ground-truth forest datasets for the machine learning and forestry community.
Stars: ✭ 111 (+68.18%)
Mutual labels:  datasets

Masader

The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each dataset. You can view the list of all datasets using the link of the webiste https://arbml.github.io/masader/

Title Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Authors Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani
https://arxiv.org/abs/2110.06744

Abstract: The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.*

Metadata

  • No. dataset number
  • Name name of the dataset
  • Subsets subsets of the datasets
  • Link direct link to the dataset or instructions on how to download it
  • License license of the dataset
  • Year year of the publishing the dataset/paper
  • Language ar or multilingual
  • Dialect region ar-LEV: (Arabic(Levant)), country ar-EGY: (Arabic (Egypt)) or type ar-MSA: (Arabic (Modern Standard Arabic))
  • Domain social media, news articles, reviews, commentary, books, transcribed audio or other
  • Form text, audio or sign language
  • Collection style crawling, crawling and annotation (translation), crawling and annotation (other), machine translation, human translation, human curation or other
  • Description short statement describing the dataset
  • Volume the size of the dataset in numbers
  • Unit unit of the volume, could be tokens, sentences, documents, MB, GB, TB, hours or other
  • Provider company or university providing the dataset
  • Related Datasets any datasets that is related in terms of content to the dataset
  • Paper Title title of the paper
  • Paper Link direct link to the paper pdf
  • Script writing system either Arab, Latn, Arab-Latn or other
  • Tokenized whether the dataset is segmented using morphology: Yes or No
  • Host the host website for the data i.e GitHub
  • Access the data is either free, upon-request or with-fee.
  • Cost cost of the data is with-fee.
  • Test split does the data contain test split: Yes or No
  • Tasks the tasks included in the dataset spearated by comma
  • Evaluation Set the data included in the evaluation suit by BigScience
  • Venue Title the venue title i.e ACL
  • Citations the number of citations
  • Venue Type conference, workshop, journal or preprint
  • Venue Name full name of the venue i.e Associations of computation linguistics
  • authors list of the paper authors separated by comma
  • affiliations list of the paper authors' affiliations separated by comma
  • abstract abstract of the paper
  • Added by name of the person who added the entry
  • Notes any extra notes on the dataset

Access Data

You can access the annoated dataset using datasets

from datasets import load_dataset 
masader = load_dataset('arbml/masader')
masader['train'][0]

which gives the following output

{'Abstract': 'Modern Standard Arabic (MSA) is the official language used in education and media across the Arab world both in writing and formal speech. However, in daily communication several dialects depending on the country, region as well as other social factors, are used. With the emergence of social media, the dialectal amount of data on the Internet have increased and the NLP tools that support MSA are not well-suited to process this data due to the difference between the dialects and MSA. In this paper, we construct the Shami corpus, the first Levantine Dialect Corpus (SDC) covering data from the four dialects spoken in Palestine, Jordan, Lebanon and Syria. We also describe rules for pre-processing without affecting the meaning so that it is processable by NLP tools. We choose Dialect Identification as the task to evaluate SDC and compare it with two other corpora. In this respect, experiments are conducted using different parameters based on n-gram models and Naive Bayes classifiers. SDC is larger than the existing corpora in terms of size, words and vocabularies. In addition, we use the performance on the Language Identification task to exemplify the similarities and differences in the individual dialects.',
 'Access': 'Free',
 'Added By': 'nan',
 'Affiliations': ',The Islamic University of Gaza,,',
 'Authors': 'Chatrine Qwaider,Motaz Saad,S. Chatzikyriakidis,Simon Dobnik',
 'Citations': '25.0',
 'Collection Style': 'crawling and annotation(other)',
 'Cost': 'nan',
 'Derived From': 'nan',
 'Description': 'the first Levantine Dialect Corpus (SDC) covering data from the four dialects spoken in Palestine, Jordan, Lebanon and Syria.',
 'Dialect': 'ar-LEV: (Arabic(Levant))',
 'Domain': 'social media',
 'Ethical Risks': 'Medium',
 'Form': 'text',
 'Host': 'GitHub',
 'Language': 'ar',
 'License': 'Apache-2.0',
 'Link': 'https://github.com/GU-CLASP/shami-corpus',
 'Name': 'Shami',
 'Paper Link': 'https://aclanthology.org/L18-1576.pdf',
 'Paper Title': 'Shami: A Corpus of Levantine Arabic Dialects',
 'Provider': 'Multiple institutions ',
 'Script': 'Arab',
 'Subsets': [{'Dialect': 'ar-JO: (Arabic (Jordan))',
   'Name': 'Jordanian',
   'Unit': 'sentences',
   'Volume': '32,078'},
  {'Dialect': 'ar-PS: (Arabic (Palestinian Territories))',
   'Name': 'Palestanian',
   'Unit': 'sentences',
   'Volume': '21,264'},
  {'Dialect': 'ar-SY: (Arabic (Syria))',
   'Name': 'Syrian',
   'Unit': 'sentences',
   'Volume': '48,159'},
  {'Dialect': 'ar-LB: (Arabic (Lebanon))',
   'Name': 'Lebanese',
   'Unit': 'sentences',
   'Volume': '16,304'}],
 'Tasks': 'dialect identification',
 'Test Split': 'No',
 'Tokenized': 'No',
 'Unit': 'sentences',
 'Venue Name': 'International Conference on Language Resources and Evaluation',
 'Venue Title': 'LREC',
 'Venue Type': 'conference',
 'Volume': '117,805',
 'Year': 2018}

Running MASADER locally with Jekyll

Prerequisites:

  1. Install Ruby.
  2. Install bundle.
  3. Install Jekyll.

Steps:

  1. Open Project in the Terminal
  2. Run bundle install to install the project's dependencies.
  3. Run the site locally with bundle exec jekyll serve.
  4. To preview MASADER site, in your web browser, navigate to http://127.0.0.1:4000/masader/ .

Note: Navigate to the publishing source for MASADER site. For more information about publishing sources, see.

Contribution

The catalogue will be updated regularly. If you want to add a new dataset, use this form.

Collaborative Work

Masader was developed as part of the BigScience project for open research 🌸, a year-long initiative targeting the study of large langauge models and datasets. The goal of the project is to research language models in a public environment outside large technology companies. The project has more than 700 researchers from 50 countries and more than 250 institutions. Mainly, we conducted the research as part of the data sourcing working group which is responsible for collecting sources for multilple languages.

Citation

@misc{alyafeai2021masader,
      title={Masader: Metadata Sourcing for Arabic Text and Speech Data Resources}, 
      author={Zaid Alyafeai and Maraim Masoud and Mustafa Ghaleb and Maged S. Al-shaibani},
      year={2021},
      eprint={2110.06744},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].