All Projects → cahya-wirawan → opentc

cahya-wirawan / opentc

Licence: MIT License
OpenTC is a text classification engine using several algorithms in machine learning

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to opentc

Binary-Text-Classification-Doc2vec-SVM
A Python implementation of a binary text classifier using Doc2Vec and SVM
Stars: ✭ 16 (-40.74%)
Mutual labels:  text-classification, svm-classifier
Nepali-News-Classifier
Text Classification of Nepali Language Document. This Mini Project was done for the partial fulfillment of NLP Course : COMP 473.
Stars: ✭ 13 (-51.85%)
Mutual labels:  text-classification, svm-classifier
monkeylearn-java
Official Java client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Java apps.
Stars: ✭ 23 (-14.81%)
Mutual labels:  text-classification
text-classification-svm
The missing SVM-based text classification module implementing HanLP's interface
Stars: ✭ 46 (+70.37%)
Mutual labels:  text-classification
Python-for-Text-Classification
Python for Text Classification with Machine Learning in Python 3.6.
Stars: ✭ 32 (+18.52%)
Mutual labels:  text-classification
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (+25.93%)
Mutual labels:  text-classification
DaDengAndHisPython
【微信公众号:大邓和他的python】, Python语法快速入门https://www.bilibili.com/video/av44384851 Python网络爬虫快速入门https://www.bilibili.com/video/av72010301, 我的联系邮箱[email protected]
Stars: ✭ 59 (+118.52%)
Mutual labels:  text-classification
fake-news-detection
This repo is a collection of AWESOME things about fake news detection, including papers, code, etc.
Stars: ✭ 34 (+25.93%)
Mutual labels:  text-classification
Activity-Recognition-CovMIJ
Skeleton-based method for activity recognition problem
Stars: ✭ 13 (-51.85%)
Mutual labels:  svm-classifier
medical-diagnosis-cnn-rnn-rcnn
分别使用rnn/cnn/rcnn来实现根据患者描述,进行疾病诊断
Stars: ✭ 39 (+44.44%)
Mutual labels:  text-classification
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-44.44%)
Mutual labels:  text-classification
DeepClassifier
DeepClassifier is aimed at building general text classification model library.It's easy and user-friendly to build any text classification task.
Stars: ✭ 25 (-7.41%)
Mutual labels:  text-classification
augmenty
Augmenty is an augmentation library based on spaCy for augmenting texts.
Stars: ✭ 101 (+274.07%)
Mutual labels:  text-classification
Kaggle-project-list
Summary of my projects on kaggle
Stars: ✭ 20 (-25.93%)
Mutual labels:  text-classification
Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Stars: ✭ 22 (-18.52%)
Mutual labels:  text-classification
text-classification-small-datasets
Building a text classifier with extremely small datasets
Stars: ✭ 34 (+25.93%)
Mutual labels:  text-classification
ExtendedMorphologicalProfiles
Remote sensed hyperspectral image classification with Spectral-Spatial information provided by the Extended Morphological Profiles
Stars: ✭ 32 (+18.52%)
Mutual labels:  svm-classifier
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (-18.52%)
Mutual labels:  text-classification
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (+22.22%)
Mutual labels:  text-classification
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+425.93%)
Mutual labels:  text-classification

Open Text Classification (OpenTC)

OpenTC is a text classification engine using machine learning. It is designed as client-server architecture and uses python libraries scikit-learn and tensorflow for it's machine learning algorithms. Currently following algorithms are supported:

  • Naive Bayes
  • Support Vector Machine
  • Convolutional Neural Network

In the future it will also support FastText from Facebookresearch.

The engine is running as a server listening on command and text to be classified. By default it listens on localhost port 3333, but it can be changed in the yaml configuration file.

OpenTC can be used for example for text classification (a demo website for this purpose is available online OpenTC demo), or for other purposes such as Data Leak Prevention (DLP). An example of implementation for the DLP has been created as ICAP Server: opentc-icap

Requirements

  • Python 3.x
  • numpy
  • pyparsing
  • PyYAML
  • scikit-learn
  • scipy
  • tensorflow 1.x

How to use

Installation

Install the module using pip:

$ pip install opentc

or clone the repository

$ git clone https://github.com/cahya-wirawan/opentc.git
$ cd opentc
$ python setup.py install

opentc

synopsis

opentc

Description

The command line to train the application based on the datasets define in the configuration file. The result of the training (pre-trained data) can be used for the opentcd server.

Usage

$ python opentc -h
usage: opentc [-h] [-c CLASSIFIER] [-C CONFIGURATION_FILE] [-d DATASET]
              [-l LOG_CONFIGURATION_FILE]

optional arguments:
  -h, --help            show this help message and exit
  -c CLASSIFIER, --classifier CLASSIFIER
                        set classifier to use for the training (support
                        currently bayesian, svm or cnn)
  -C CONFIGURATION_FILE, --configuration_file CONFIGURATION_FILE
                        set the configuration file
  -d DATASET, --dataset DATASET
                        set dataset to use for the training
  -l LOG_CONFIGURATION_FILE, --log_configuration_file LOG_CONFIGURATION_FILE
                        set the log configuration file

opentcd

synopsis

opentcd

Description

The daemon listens for incoming connections on TCP port (default is 3333) and classify files or text string on demand. It reads a configuration file in the following order: ./opentc.yml, ~/.opentc/opentc.yml or /etc/opentc/opentc.yml.

Usage

Opentcd uses the configuration file opentc.yml to define allmost all possible configuration. Only few setup can be overridden in command line options.

List of arguments:

$ python opentcd -h
usage: opentcd [-h] [-a ADDRESS] [-C CONFIGURATION_FILE]
               [-l LOG_CONFIGURATION_FILE] [-p PORT] [-t TIMEOUT]

optional arguments:
  -h, --help            show this help message and exit
  -a ADDRESS, --address ADDRESS
                        define the address for the server
  -C CONFIGURATION_FILE, --configuration_file CONFIGURATION_FILE
                        set the configuration file
  -l LOG_CONFIGURATION_FILE, --log_configuration_file LOG_CONFIGURATION_FILE
                        set the log configuration file
  -p PORT, --port PORT  define the port number which the server uses to listen
  -t TIMEOUT, --timeout TIMEOUT
                        define the time out

Run it as background application:

$ python opentcd&
2017-05-02 13:33:22,276 - opentc.core.classifier.cnn_text - DEBUG - Load the checkpoint: 
data/input/cnn_twenty_newsgroup_20170301_090000-all/checkpoints/model-2210
INFO:tensorflow:Restoring parameters from data/input/cnn_twenty_newsgroup_20170301_090000-all/checkpoints/model-2210
2017-05-02 13:33:23,899 - tensorflow - INFO - Restoring parameters 
from data/input/cnn_twenty_newsgroup_20170301_090000-all/checkpoints/model-2210
2017-05-02 13:33:27,375 - __main__ - INFO - Server start
2017-05-02 13:33:28,019 - opentc.core.server - INFO - Server loop running in thread: Thread-1

datasets and pre-trained data

The configuration file defines the path to the datasets and pre-trained data. A pre-trained data for testing purpose can be downloaded from data, it is around 1.4GB. Just uncompress it and change the path to the pre-trained data in opentc.yml file accordingly.

Commands

The command uses a newline character as the delimiter. If opentcd doesn't recognize the command, or the command doesn't follow the requirements specified below, it will reply with an error message, but still wait for the next commands (this behaviour can be changed in the future).

PING

Check the server's state. It should reply with "PONG".

VERSION

Print the program version

RELOAD

Reload the engine

LIST_CLASSIFIER

List the supported classifiers (at the moment there are three classifiers supported: Bayesian, Support Vector Machine and Convolutional Neural Network). It shows also the status of classifier, either True (enabled) or False (disabled).

SET_CLASSIFIER

Enabled or disabled the specific classifier

PREDICT_STREAM

Classify text streams. It uses a new line character as delimiter for every sentences.

PREDICT_FILE

Classify file. It uses a new line character as delimiter for every sentences

CLOSE

Close the connection

Todo

  • Multilabel classification
  • Include FastText from Facebookresearch
  • Will use pyzmq and google's protobuf to improve the protocol and network communication
  • considering a multi processing server instead of multi threading due to the global interpreter lock used in threads which prevent the code to run really concurrently in multi processor environment.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].