All Projects → oligoglot → theedhum-nandrum

oligoglot / theedhum-nandrum

Licence: Apache-2.0 license
A sentiment classifier on mixed language (and mixed script) reviews in Tamil, Malayalam and English

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to theedhum-nandrum

YouTube to m3u
Grab .m3u8 from YouTube live channels and makes .m3u IPTV Playlist from various languages and Events. Tamil / Malayalam / English / Hindi / French / Kids / Sports / Urudu etc.
Stars: ✭ 48 (+200%)
Mutual labels:  malayalam, tamil
govarnam
Easily Type Indian Languages on computer and mobile. GoVarnam is a cross-platform transliteration library. Manglish -> Malayalam, Thanglish -> Tamil, Hinglish -> Hindi plus another 10 languages. GoVarnam is a near-Go port of libvarnam
Stars: ✭ 97 (+506.25%)
Mutual labels:  malayalam, tamil
SGDLibrary
MATLAB/Octave library for stochastic optimization algorithms: Version 1.0.20
Stars: ✭ 165 (+931.25%)
Mutual labels:  sgd, logistic-regression
Tensorflow Ml Nlp
텐서플로우와 머신러닝으로 시작하는 자연어처리(로지스틱회귀부터 트랜스포머 챗봇까지)
Stars: ✭ 176 (+1000%)
Mutual labels:  logistic-regression
Textclassification
several methods for text classification
Stars: ✭ 180 (+1025%)
Mutual labels:  logistic-regression
batchnorm-pruning
Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers https://arxiv.org/abs/1802.00124
Stars: ✭ 66 (+312.5%)
Mutual labels:  sgd
AutoOpt
Automatic and Simultaneous Adjustment of Learning Rate and Momentum for Stochastic Gradient Descent
Stars: ✭ 44 (+175%)
Mutual labels:  sgd
Deep Math Machine Learning.ai
A blog which talks about machine learning, deep learning algorithms and the Math. and Machine learning algorithms written from scratch.
Stars: ✭ 173 (+981.25%)
Mutual labels:  logistic-regression
Awd Lstm Lm
LSTM and QRNN Language Model Toolkit for PyTorch
Stars: ✭ 1,834 (+11362.5%)
Mutual labels:  sgd
numpy-neuralnet-exercise
Implementation of key concepts of neuralnetwork via numpy
Stars: ✭ 49 (+206.25%)
Mutual labels:  sgd
FactorizationMachine
implementation of factorization machine, support classification.
Stars: ✭ 19 (+18.75%)
Mutual labels:  sgd
Fake news detection
Fake News Detection in Python
Stars: ✭ 194 (+1112.5%)
Mutual labels:  logistic-regression
TransE
TransE方法的Python实现,解释SGD中TransE的向量更新
Stars: ✭ 31 (+93.75%)
Mutual labels:  sgd
Deeplearning.ai
该存储库包含由deeplearning.ai提供的相关课程的个人的笔记和实现代码。
Stars: ✭ 181 (+1031.25%)
Mutual labels:  logistic-regression
Python-AndrewNgML
Python implementation of Andrew Ng's ML course projects
Stars: ✭ 24 (+50%)
Mutual labels:  logistic-regression
Machine Learning Is All You Need
🔥🌟《Machine Learning 格物志》: ML + DL + RL basic codes and notes by sklearn, PyTorch, TensorFlow, Keras & the most important, from scratch!💪 This repository is ALL You Need!
Stars: ✭ 173 (+981.25%)
Mutual labels:  logistic-regression
DiFacto2 ffm
Distributed Fieldaware Factorization Machines based on Parameter Server
Stars: ✭ 11 (-31.25%)
Mutual labels:  sgd
LinkOS-Android-Samples
Java based sample code for developing on Android. The demos in this repository are stored on separate branches. To navigate to a demo, please click branches.
Stars: ✭ 52 (+225%)
Mutual labels:  sgd
Voice Gender
Gender recognition by voice and speech analysis
Stars: ✭ 248 (+1450%)
Mutual labels:  logistic-regression
AIML-Projects
Projects I completed as a part of Great Learning's PGP - Artificial Intelligence and Machine Learning
Stars: ✭ 85 (+431.25%)
Mutual labels:  logistic-regression

theedhum-nandrum (தீதும் நன்றும்)

A sentiment classifier on mixed language (and mixed script) reviews in Tamil, Malayalam and English. You can read our paper describing the approach at https://arxiv.org/abs/2010.03189. Please cite our paper if you are using this.

@misc{lakshmanan2020theedhum, title={Theedhum Nandrum@Dravidian-CodeMix-FIRE2020: A Sentiment Polarity Classifier for YouTube Comments with Code-switching between Tamil, Malayalam and English}, author={BalaSundaraRaman Lakshmanan and Sanjeeth Kumar Ravindranath}, year={2020}, eprint={2010.03189}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Image of TheedhumNandrum

Installation

Pre-requisites

  • Python 3.7 or above

Getting the code


  • cd /path/to/parent/
  • git clone https://github.com/oligoglot/theedhum-nandrum.git
  • cd theedhum-nandrum

Setting up dev environment


  • virtualenv venv_tn
  • source venv_tn/bin/activate
  • pip install -r requirements.txt

Running the classification scripts


  • You need to activate the virtualenv
    • source venv_tn/bin/activate
  • cd src/tn
  • Hyper Parameter Tuning for SGD Classifier
    • python3 sentiment_classifier.py experiment ta ../../resources/data/tamil_train.tsv ../../resources/data/tamil_dev.tsv configs/tuning_experiments_1.json
  • Classification for Tamil Input Set
    • python3 sentiment_classifier.py test ta ../../resources/data/tamil_train.tsv ../../resources/data/tamil_dev.tsv <output File>
  • Classification for Malayalam Input Set
    • python3 sentiment_classifier.py test ml ../../resources/data/malayalam_train.tsv ../../resources/data/malayalam_dev.tsv <output File>

Steps

Pre-processing

Noise removal

  1. Remove irrelevant parts of the data, like html tags

Language identification

  1. If the text is a different language, need to output "Not tamil"

Attributions

  1. Spelling Corrector in Python 3; see http://norvig.com/spell-correct.html Copyright (c) 2007-2016 Peter Norvig MIT license: www.opensource.org/licenses/mit-license.php
  2. Module to convert Unicode Emojis to corresponding Sentiment Rankings. Based on the research by Kralj Novak P, Smailović J, Sluban B, Mozetič I (2015) on Sentiment of Emojis. Journal Link: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144296 CSV Data acquired from CLARIN repository, Repository Link: http://hdl.handle.net/11356/1048
  3. Datasets: @inproceedings{chakravarthi-etal-2020-corpus, title = "Corpus Creation for Sentiment Analysis in Code-Mixed {T}amil-{E}nglish Text", author = "Chakravarthi, Bharathi Raja and Muralidaran, Vigneshwaran and Priyadharshini, Ruba and McCrae, John Philip", booktitle = "Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources association", url = "https://www.aclweb.org/anthology/2020.sltu-1.28", pages = "202--210", abstract = "Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.", language = "English", ISBN = "979-10-95546-35-1", } @inproceedings{Chakravarthi2020ASA, title={A Sentiment Analysis Dataset for Code-Mixed Malayalam-English}, author={Bharathi Raja Chakravarthi and Navya Jose and Shardul Suryawanshi and E. Sherly and John P. McCrae}, booktitle={SLTU/CCURL@LREC}, year={2020} }
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].