All Projects → imadelh → Nlp News Classification

imadelh / Nlp News Classification

Train and deploy a News Classifier using language model (ULMFit) - Serverless container

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Nlp News Classification

Alagarr
🦍 Alagarr is a request-response helper library that removes the boilerplate from your Node.js (AWS Lambda) serverless functions and helps make your code portable.
Stars: ✭ 58 (-7.94%)
Mutual labels:  serverless
Graphql Serverless
Example boilerplates for GraphQL backends hosted on serverless platforms
Stars: ✭ 60 (-4.76%)
Mutual labels:  serverless
Syncano Node
Syncano Toolkit for JavaScript development
Stars: ✭ 61 (-3.17%)
Mutual labels:  serverless
Apex
Old apex/apex
Stars: ✭ 20 (-68.25%)
Mutual labels:  serverless
Applied Text Mining In Python
Repo for Applied Text Mining in Python (coursera) by University of Michigan
Stars: ✭ 59 (-6.35%)
Mutual labels:  text-classification
Mycail
中国法研杯-司法人工智能挑战赛
Stars: ✭ 60 (-4.76%)
Mutual labels:  text-classification
Fasttext.py
A Python interface for Facebook fastText
Stars: ✭ 1,091 (+1631.75%)
Mutual labels:  text-classification
Blockstack Browser
The Blockstack Browser
Stars: ✭ 1,119 (+1676.19%)
Mutual labels:  serverless
Freediscovery
Web Service for E-Discovery Analytics
Stars: ✭ 59 (-6.35%)
Mutual labels:  text-classification
Bombermon
A multiplayer game (Bomberman-like) using Serverless concepts
Stars: ✭ 60 (-4.76%)
Mutual labels:  serverless
Up
Up focuses on deploying "vanilla" HTTP servers so there's nothing new to learn, just develop with your favorite existing frameworks such as Express, Koa, Django, Golang net/http or others.
Stars: ✭ 8,439 (+13295.24%)
Mutual labels:  serverless
Serverless Reqvalidator Plugin
Serverless plugin to attach AWS API Gateway Basic Request Validation https://rafpe.ninja/2017/12/18/serverless-own-plugin-to-attach-aws-api-gateway-basic-request-validation/
Stars: ✭ 59 (-6.35%)
Mutual labels:  serverless
Textblob Ar
Arabic support for textblob
Stars: ✭ 60 (-4.76%)
Mutual labels:  text-classification
Pulumi
Pulumi - Developer-First Infrastructure as Code. Your Cloud, Your Language, Your Way 🚀
Stars: ✭ 10,887 (+17180.95%)
Mutual labels:  serverless
Ng Toolkit
⭐️ Angular tool-box! Start your PWA in two steps! Add Serverless support for existing projects and much more
Stars: ✭ 1,116 (+1671.43%)
Mutual labels:  serverless
Checkout Netlify Serverless
Sell products on the Jamstack with Netlify Functions and Stripe Checkout!
Stars: ✭ 58 (-7.94%)
Mutual labels:  serverless
Faas Containerd
containerd and CNI provider for OpenFaaS
Stars: ✭ 60 (-4.76%)
Mutual labels:  serverless
Sentiment analysis albert
sentiment analysis、文本分类、ALBERT、TextCNN、classification、tensorflow、BERT、CNN、text classification
Stars: ✭ 61 (-3.17%)
Mutual labels:  text-classification
Serverless Api Example
Example of a Golang, Serverless API
Stars: ✭ 62 (-1.59%)
Mutual labels:  serverless
Functions Csharp Eventhub Ordered Processing
Example of processing events in order with the Azure Functions Event Hubs trigger
Stars: ✭ 60 (-4.76%)
Mutual labels:  serverless

NLP - News classification

Train and deploy a news classifier based on ULMFit.

Running on cloud/local machine

To run the application, we can use the pre-build docker image available on Docker Hub and simply run the following command

docker run --rm -p 8080:8080 imadelh/news:v1

The application will be available on http://0.0.0.0:8080. The user can run a customized Gunicorn command to specify number of workers or an HTTPS certificate.

# Get into the container
docker run -it --rm -v ~/nlp:/cert -p 8080:8080 imadelh/news:v1 bash

# Run Gunicorn with specefic number of workers/threads
gunicorn --certfile '/path_to/chain.pem' --keyfile '/path_to/key.pem' --workers=4 --bind 0.0.0.0:8080 wsgi:app

Serverless deployement - Google Run

Google Run is a new service from GCP that allows serverless deployment of containers with HTTPS endpoints. The app will run on 1 CPU with 2GB memory and have the ability to scale automatically depending on the number of concurrent requests.

  • Build image and push it to Container Registry

From a GCP project, we will use Google Shell to build the image and push it to GCR (container registry).

# Get name of project 
# For illustration we will call it PROJECT-ID

gcloud config get-value project

Create the following Dockerfile in your CloudShell session.

FROM imadelh/news:v_1cpu

# Google Run uses env variable PORT 

CMD gunicorn --bind :$PORT wsgi:app

Finally, we can build and submit the image to GCR.

gcloud builds submit --tag gcr.io/PROJECT-ID/news_classifier
  • Deploy on Google Run

From Google Run page, we will use the image gcr.io/PROJECT-ID/news_classifier:latest to run the app. Create a new service

Then enter the address of the image, choose other parameters as follows and deploy

After few seconds, you will see a link to the app.

Serverless version may suffer from cold-start if the service does not receive requests for a long time.

Reproduce results

LR and SVM

  • Requirements

To reproduce results reported in the blog post, we need to install the requirements in our development environment.

# Open requirement.txt and select torch==1.1.0 instead of the cpu version used for inference only.
# Then install requirements
pip install -r requirements.txt
  • Hyper-parameter search

After completing the installation, we can run parameters search or training of sklearn models as follows

# Params search for SVM
cd sklearn_models
python3 params_search.py --model svc --exp_name svmsearch_all --data dataset_processed

# Params search for LR
python3 params_search.py --model lreg --exp_name logreg_all --data dataset_processed

The parameters space is defined in the file sklearn_models/params_search.py. The outputs will be saved in the logs folder.

  • Training

Training a model for a fixed set of parameters can be done using sklearn_models/baseline.py

# Specify the parameters of the model inside baseline.py and run
python3 baseline.py --model svc --exp_name svc_all --data dataset_processed

The logs/metrics on test dataset will be saved in sklearn_models/logs/ and the trained model will be saved in sklearn_models/saved_models/.

ULMFit

To reproduce/train ULMFit model, the notebooks available in ulmfit/ are used. Same requirements are needed as explained before. We will need a GPU to fine-tune LM models, this can be done using Google Colab.

  • Notebook contents:

    • data preparation
    • Fine-tune ULMFit
    • Train ULMFit classifier
    • Predictions and evaluation
    • Exporting the trained model
    • Inference on CPU

To be able to run the training, we need to specify the path to a folder where the training data is stored.

  • Locally:

Save data from data/, then specify the absolute PATH in the beginning of the notebook.

# This is the absolute path to where folder "data" is available
PATH = "/app/analyse/"
  • Google Colab:

Save the data in Google drive folder, for example files/nlp/

# The folder 'data' is saved in Google drive in "files/nlp/"
# While running the notebook from google colab, mount the drive and define PATH to data
from google.colab import drive
drive.mount('/content/gdrive/')

# then give the path where your data is stored (in google drive)
PATH = "/content/gdrive/My Drive/files/nlp/"

01_ulmfit_balanced_dataset.ipynb Open In Colab - Train ULMfit on balanced dataset

02_ulmfit_all_data.ipynb Open In Colab - Train ULMFit on full dataset

Performance

Performance of ULMFit on the test dataset data/dataset_inference (see end of 02_ulmfit_all_data.ipynb for the definition of test dataset).

# ULMFit - Performance on test dataset
            precision    recall  f1-score   support
micro avg                           0.73     20086
macro avg       0.66      0.61      0.63     20086
weighted avg    0.72      0.73      0.72     20086

Top 3 accuracy on test dataset:
0.9044

Trained model is available for download at: https://github.com/imadelh/NLP-news-classification/releases/download/v1.0/ulmfit_model

This project is a very basic text classifier. Here is a list of other features that could be added

  • Feedback option to allow the user to submit a correction of the prediction.
  • Fine-tune the model periodically based on new feedbacks.
  • Compare performance to other language models (BERT, XLNet, etc).

Imad El Hanafi

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].