All Projects → cair → TextUnderstandingTsetlinMachine

cair / TextUnderstandingTsetlinMachine

Licence: other
Using the Tsetlin Machine to learn human-interpretable rules for high-accuracy text categorization with medical applications

Programming Languages

Cuda
1817 projects
python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to TextUnderstandingTsetlinMachine

fast-tsetlin-machine-with-mnist-demo
A fast Tsetlin Machine implementation employing bit-wise operators, with MNIST demo.
Stars: ✭ 58 (+20.83%)
Mutual labels:  frequent-pattern-mining, tsetlin-machine
Openemr
The most popular open source electronic health records and medical practice management solution.
Stars: ✭ 1,762 (+3570.83%)
Mutual labels:  medical, medical-informatics
humanapi
The easiest way to integrate health data from anywhere - https://www.humanapi.co
Stars: ✭ 21 (-56.25%)
Mutual labels:  medical
ChRIS store ui
UI for the ChRIS Store
Stars: ✭ 47 (-2.08%)
Mutual labels:  medical
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-54.17%)
Mutual labels:  text-classification
FPGrowth-and-Apriori-algorithm-Association-Rule-Data-Mining
Implementation of FPTree-Growth and Apriori-Algorithm for finding frequent patterns in Transactional Database.
Stars: ✭ 19 (-60.42%)
Mutual labels:  frequent-pattern-mining
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (-31.25%)
Mutual labels:  text-classification
ChRIS ui
UI for ChRIS
Stars: ✭ 20 (-58.33%)
Mutual labels:  medical
HiLAP
Code for paper "Hierarchical Text Classification with Reinforced Label Assignment" EMNLP 2019
Stars: ✭ 116 (+141.67%)
Mutual labels:  text-classification
infirmary-integrated
Medical device simulator for training healthcare professionals.
Stars: ✭ 27 (-43.75%)
Mutual labels:  medical
summarize-radiology-findings
Code and pretrained model for paper "Learning to Summarize Radiology Findings"
Stars: ✭ 63 (+31.25%)
Mutual labels:  medical-informatics
MICCAI21 MMQ
Multiple Meta-model Quantifying for Medical Visual Question Answering
Stars: ✭ 16 (-66.67%)
Mutual labels:  medical
fake-news-detection
This repo is a collection of AWESOME things about fake news detection, including papers, code, etc.
Stars: ✭ 34 (-29.17%)
Mutual labels:  text-classification
opentc
OpenTC is a text classification engine using several algorithms in machine learning
Stars: ✭ 27 (-43.75%)
Mutual labels:  text-classification
ebe-dataset
Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
Stars: ✭ 16 (-66.67%)
Mutual labels:  text-classification
Kaggle-Twitter-Sentiment-Analysis
Kaggle Twitter Sentiment Analysis Competition
Stars: ✭ 18 (-62.5%)
Mutual labels:  text-classification
wolfpacs
WolfPACS is an DICOM load balancer written in Erlang.
Stars: ✭ 1 (-97.92%)
Mutual labels:  medical
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+195.83%)
Mutual labels:  text-classification
rdm
Our regulatory documentation manager. Streamlines 62304, 14971, and 510(k) documentation for software projects.
Stars: ✭ 71 (+47.92%)
Mutual labels:  medical
WeSTClass
[CIKM 2018] Weakly-Supervised Neural Text Classification
Stars: ✭ 67 (+39.58%)
Mutual labels:  text-classification

Text Understanding with the Tsetlin Machine

This project contains the source code for a text categorization approach based on the recently introduced Tsetlin Machine. We have included the CUDA implementation, and will later also include the C source code. This is the code that we used to produce the empirical results in the paper "Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications" (https://arxiv.org/abs/1809.04547). We have also included a script that preprocesses the IMDB Movie Review dataset for analysis by the Tsetlin Machine, plus an anonymized version of the clinically derived dataset described and referred to in Section III.A of the revised paper (soon to be published).

Background

Although deep learning in the form of convolutional neural networks (CNN), recurrent neural networks (RNNs), and Long Short Term Memory (LSTM) recently has provided a leap ahead in text categorization accuracy, this leap has come at the expense of interpretability and computational complexity.

The Tsetlin automaton developed by M.L. Tsetlin in the Soviet Union in the late 1950s represents a fundamental and versatile learning mechanism; arguably even more so than the artificial neuron. While the Tsetlin Machine is able to form complex non-linear patterns, the propositional formulae it composes have also turned out to be particularly suitable for human interpretation.

Briefly, the Tsetlin Machine represents the terms of a text as propositional variables. From these, categories are captured using simple propositional formulae, such as: if “rash” and “reaction” and “penicillin” then Allergy. The Tsetlin Machine learns these formulae from a labelled text, utilizing conjunctive clauses to represent the particular facets of each category. Indeed, even the absence of terms (negated features) can be used for categorization purposes. The figure below is from our text categorization paper and captures the essence of this process (the paper explains the details of the figure). It reflects a running example from the medical domain, where we use the Tsetlin Machine to detect patient allergies in Electronic Health Records.

Figure 4

The combination of computational simplicity, accuracy, and results that are highly interpretable, leaves the Tsetlin Machine worthy of further exploration in many text analysis directions.

For a general introduction to the Tsetlin Machine, we refer the reader to the original paper (https://arxiv.org/pdf/1804.01508.pdf). For the details on the text analysis approach, see https://arxiv.org/abs/1809.04547,

Also notice that research on the Tsetlin Machine for text understanding is an on-going project at the Centre for Artificial Intelligence Research CAIR at the University of Agder, Norway. More information will be added as the project advances.

Requirements

Instructions for use

  1. Edit the GPUConfig.cuh file (adjust the GRID_SIZE and BLOCK_SIZE parameters for optimal performance of your Nvidia card).
  2. Use the following command to compile the necessary C++ CUDA executable: nvcc -O3 TsetlinMachineIMDB.cu TsetlinMachine.cu TsetlinMachineKernels.cu -lcurand -o TsetlinMachineIMDB.out (don't worry about the compiler warnings produced)
  3. Produce the IMDB dataset by running the command: python produce_dataset.py
  4. Run the TsetlinMachine executable produced in step 2 to start categorization of the IMDB dataset: ./TsetlinMachineIMDB.out
  5. The results for each epoch are calculated as true positive, true negative, false positive, false negative, accuracy, recall, and precision (produced on screen and also saved to the file statistics.txt).

Important meta-parameters used for tuning the Tsetlin Machine

You may experiment with the Tsetlin Machine's hyperparameter settings by editing the TsetlinMachineConfig.cuh and the TsetlinMachineIMDB.cu files. Just remember to recompile after editing before you try a rerun!

The most important parameters of the Tsetlin Machine that we adjust are features, clauses, classes, epochs, s, threshold T, and states. Features describes the number of features in the dataset, and must be correct for the adequate number of Tsetlin automata to be initialized by the Tsetlin Machine. Clauses describes the number of clauses that the Tsetlin Machine will implement to solve the text learning problem, and roughly translates to the number of hidden nodes in a neural network layer. Epochs is similar to epochs typically used for artificial neural network training, and describes the number of cycles the Tsetlin Machine runs on the training dataset. T describes a value for the threshold function regulating how easily the available clauses are spent representing each specific sub-pattern, while s denotes precision and determines the granularity of the sub-patterns captured by the literals in the Tsetlin Machine. States describe the number of states for each Tsetlin automaton.

Further Work

  • Optimize code base further.
  • Leverage the Convolutional Tsetlin Machine to further increase accuracy performance.

Citation

Please cite the relevant Tsetlin Machine papers if you use the Tsetlin Machine in your work:

@article{berge2019, 
  author={Geir Thore {Berge} and Ole-Christoffer {Granmo} and Tor Oddbjørn {Tveit} and Morten {Goodwin} and Lei {Jiao} and Bernt Viggo {Matheussen}}, 
  journal={IEEE Access}, 
  title="{Using the Tsetlin Machine to Learn Human-Interpretable Rules for High-Accuracy Text Categorization with Medical Applications}",
  volume={7},
  pages={115134-115146}, 
  year={2019}, 
  doi={10.1109/ACCESS.2019.2935416}, 
  ISSN={2169-3536}
}
@article{granmo2018tsetlin,
  author = {{Granmo}, Ole-Christoffer},
  title = "{The Tsetlin Machine - A Game Theoretic Bandit Driven Approach to Optimal Pattern Recognition with Propositional Logic}",
  journal={arXiv preprint arXiv:1804.01508},
  year={2018}
}

Licence

Copyright (c) 2019 Geir Thore Berge and Ole-Christoffer Granmo

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].