All Projects → wongnai → Wongnai Corpus

wongnai / Wongnai Corpus

Licence: lgpl-3.0
Collection of Wongnai's datasets

Projects that are alternatives of or similar to Wongnai Corpus

Machine Learning Resources
A curated list of awesome machine learning frameworks, libraries, courses, books and many more.
Stars: ✭ 226 (+296.49%)
Mutual labels:  datasets, nlp-machine-learning
Awesome Nlp Polish
A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.
Stars: ✭ 153 (+168.42%)
Mutual labels:  datasets, nlp-machine-learning
Codesearchnet
Datasets, tools, and benchmarks for representation learning of code.
Stars: ✭ 1,378 (+2317.54%)
Mutual labels:  datasets, nlp-machine-learning
ake-datasets
Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.
Stars: ✭ 125 (+119.3%)
Mutual labels:  datasets, nlp-machine-learning
Describing a knowledge base
Code for Describing a Knowledge Base
Stars: ✭ 42 (-26.32%)
Mutual labels:  datasets
Letslearnai.github.io
Lets Learn AI
Stars: ✭ 33 (-42.11%)
Mutual labels:  nlp-machine-learning
Sdtm mapper
AI SDTM mapping (R for ML, Python, TensorFlow for DL)
Stars: ✭ 27 (-52.63%)
Mutual labels:  nlp-machine-learning
Click2analyze Androiddevchallenge
An app to analyze the text and fixing the anomaly of the message that deviates from what is standard, normal, or expected. #AndroidDevChallenge
Stars: ✭ 20 (-64.91%)
Mutual labels:  nlp-machine-learning
Personas
Datasets for Deep learning Personas
Stars: ✭ 49 (-14.04%)
Mutual labels:  datasets
Awesome Earth Artificial Intelligence
A curated list of Earth Science's Artificial Intelligence (AI) tutorials, notebooks, software, datasets, courses, books, video lectures and papers. Contributions most welcome.
Stars: ✭ 44 (-22.81%)
Mutual labels:  datasets
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-31.58%)
Mutual labels:  nlp-machine-learning
Commons
⛲️ Commons Marketplace client & server to explore, download, and publish open data sets in the Ocean Protocol Network.
Stars: ✭ 34 (-40.35%)
Mutual labels:  datasets
Pytorch Cpp
C++ Implementation of PyTorch Tutorials for Everyone
Stars: ✭ 1,014 (+1678.95%)
Mutual labels:  datasets
Dataframes.jl
In-memory tabular data in Julia
Stars: ✭ 951 (+1568.42%)
Mutual labels:  datasets
News push project
Real Time News Scraping and Recommendation System - React | Tensorflow | NLP | News Scrapers
Stars: ✭ 44 (-22.81%)
Mutual labels:  nlp-machine-learning
Pydataset
Instant access to many datasets in Python.
Stars: ✭ 880 (+1443.86%)
Mutual labels:  datasets
Talismane
NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser
Stars: ✭ 38 (-33.33%)
Mutual labels:  nlp-machine-learning
Mitie chinese wikipedia corpus
Pre-trained Wikipedia corpus by MITIE
Stars: ✭ 43 (-24.56%)
Mutual labels:  nlp-machine-learning
French Sentiment Analysis Dataset
A collection of over 1.5 Million tweets data translated to French, with their sentiment.
Stars: ✭ 35 (-38.6%)
Mutual labels:  datasets
Segmentation wbc
White blood cell (WBC) image datasets
Stars: ✭ 35 (-38.6%)
Mutual labels:  datasets

Wongnai-corpus

This project is a collection of Wongnai's datasets which are mostly in Thai language. We hope that these datasets will advance research in natural language processing(NLP) especially in Thai language.

1. Search query dataset

There are 500,000 unique words extracted from search queries. These words were labeled by algorithms and judges for a word segmentation task. Our segmentation criteria is to segment the longest food word as possible for archiving the highest precision score in search system.

1.1 Files

  • search/labeled_queries_by_algo.txt : List of 500K words labeled by algorithms which were described in detail in blog post.

  • search/labeled_queries_by_judges.txt : List of 10K words labeled by judges following Wongnai's search criteria.

  • search/food_dictionary.txt : List of 400K food words used for labelling the labeled_queries_by_algo.txt.

Please note that these words were collected from user-generated content(UGC) which might include some out of topic words.

1.2 Usage

  • You may use labeled_queries_by_algo.txt for training your own word segmentation model by spliting into train and validation set and then evaluate your model with labeled_queries_by_judges.txt.

2. Review dataset

The review dataset contains restaurant reviews and ratings (there are only 5 classes ranging from 1 to 5 stars).

2.1 Files

2.2 Usage

Wongnai data services

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].