All Projects → doodyparizada → Word2vec Spam Filter

doodyparizada / Word2vec Spam Filter

Licence: mit
Using word vectors to classify spam messages

Programming Languages

typescript
32286 projects
stylus
462 projects

Projects that are alternatives of or similar to Word2vec Spam Filter

Livetv mining
直播网站数据采集
Stars: ✭ 188 (+26.17%)
Mutual labels:  webpack, flask
Saas Base
SaaS base application (Flask, Vue, Bootstrap, Webpack)
Stars: ✭ 208 (+39.6%)
Mutual labels:  webpack, flask
Webvectors
Web-ify your word2vec: framework to serve distributional semantic models online
Stars: ✭ 154 (+3.36%)
Mutual labels:  flask, word2vec
reach
Load embeddings and featurize your sentences.
Stars: ✭ 17 (-88.59%)
Mutual labels:  numpy, word2vec
Deep learning nlp
Keras, PyTorch, and NumPy Implementations of Deep Learning Architectures for NLP
Stars: ✭ 407 (+173.15%)
Mutual labels:  word2vec, numpy
Ni Pyt
Materiály k předmětu NI-PYT na FIT ČVUT
Stars: ✭ 112 (-24.83%)
Mutual labels:  flask, numpy
Docker Web Framework Examples
Example apps that demonstate how to use Docker with your favorite web frameworks.
Stars: ✭ 204 (+36.91%)
Mutual labels:  webpack, flask
Cookiecutter Flask
A flask template with Bootstrap 4, asset bundling+minification with webpack, starter templates, and registration/authentication. For use with cookiecutter.
Stars: ✭ 3,967 (+2562.42%)
Mutual labels:  webpack, flask
React News Board
🌀 A Full-Stack Web App built with React and Flask.
Stars: ✭ 389 (+161.07%)
Mutual labels:  webpack, flask
Flask Vuejs Madblog
基于 Flask 和 Vue.js 前后端分离的微型博客项目,支持多用户、Markdown文章(喜欢/收藏文章)、粉丝关注、用户评论(点赞)、动态通知、站内私信、黑名单、邮件支持、管理后台、权限管理、RQ任务队列、Elasticsearch全文搜索、Linux VPS部署、Docker容器部署等
Stars: ✭ 541 (+263.09%)
Mutual labels:  webpack, flask
Python Tutorial
🏃 Some of the python tutorial - 《Python学习笔记》
Stars: ✭ 122 (-18.12%)
Mutual labels:  flask, numpy
Opendatawrangling
공공데이터 분석
Stars: ✭ 148 (-0.67%)
Mutual labels:  numpy
Bento
[DEPRECATED] Find Python web-app bugs delightfully fast, without changing your workflow. 🍱
Stars: ✭ 147 (-1.34%)
Mutual labels:  flask
Vue Webgulp
Vue.js + Webpack + Gulp + Vue Loader
Stars: ✭ 146 (-2.01%)
Mutual labels:  webpack
Skip Thoughts.torch
Porting of Skip-Thoughts pretrained models from Theano to PyTorch & Torch7
Stars: ✭ 146 (-2.01%)
Mutual labels:  word2vec
Devopenclub Tech Webpack2
Webpack 2 视频教程课程源码
Stars: ✭ 148 (-0.67%)
Mutual labels:  webpack
Fasttext4j
Implementing Facebook's FastText with java
Stars: ✭ 148 (-0.67%)
Mutual labels:  word2vec
Bs Loader
📻 Bucklescript loader for Webpack and Jest
Stars: ✭ 146 (-2.01%)
Mutual labels:  webpack
Wordembeddings Elmo Fasttext Word2vec
Using pre trained word embeddings (Fasttext, Word2Vec)
Stars: ✭ 146 (-2.01%)
Mutual labels:  word2vec
Electron React Typescript Webpack Boilerplate
Pre-configured boilerplate for Electron + React + TypeScript + Webpack
Stars: ✭ 146 (-2.01%)
Mutual labels:  webpack

word2vec-spam-filter

This is a project done during the Kik hackathon 2017.

In this project we demonstrate a way to classify spam messages on the client while protecting user privacy.

A client generates a "hash" from the message sending it to the server. The server then compares the "hash" to a bank of known reported messages.

The bank of known reported messages is created from spam reports. The server compares a given reported message to the previous bank of reported messages. If the message is similar to a previously reported message, a report count is incremented. Otherwise the message is added to the bank with a count of 1.

A message in the bank of reported messages is considered a spam message once it was reported more than 3 times.

Preview

Corpus downloads

We used 2 datasets for creating sentence vectors:

  1. word vectors taken from: https://github.com/stanfordnlp/GloVe
  2. word frequencies from: https://github.com/IlyaSemenov/wikipedia-word-frequency/blob/master/results/enwiki-20150602-words-frequency.txt

Configurable parameters (Hyper-Parameters)

We played around with a few configurations to get the best results for short user messages:

  • Confidence Threshold - a number between 0.0 - 1.0 to determine when 2 messages are considered the same
  • Distance Function - we used vector dot product
  • Normalization - how to deal with words we don't have in our corpus, punctuation marks, non english words
  • Vector Size - the longer the vector the higher the accuracy but heavier in memory
  • Weight Function - given a word frequency, how to create the vector weights (the should weigh less than camera)
  • Custom Corpus - creating the word vectors and frequencies from real user message data might yield better results
  • Random Indices - how many random indices should the client send to the server to mask the original message indices

Running the code

This project includes a single makefile to help with the initialization, dependency installation and corpus download. You can invoke a help message by running:

make

Or you can manually run the server and client apps:

server

In the server directory install the pip dependencies in a virtualenv:

pip install -r requirements.txt

and run the server:

python app.py

web client

To use the web client go into the webclient directory in your terminal and then:

npm install
npm run dev

That should install all dependencies and kick start the project, if it all works you should see something like:

Project is running at http://localhost:3333/ webpack output is served from /

Now load http://localhost:3333/ in your browser

There are 3 different "view modes" which can be switched using the select box at the top right corner of the page.
The 3 views are:

  • Standalone Tester: A textarea in which one can input a message and then either report it as spam or check whether it is classified as spam.
  • IM Sender: A textarea in which the user can input a message (or select a message from a bunch of existing ones) and then "send" the message to another client.
  • IM Receiver: A view which displays a list of received messages (using the IM Sender) and the ability to report each message as spam.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].