Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → RandyPen → Textcluster

RandyPen / Textcluster

Licence: bsd-3-clause

短文本聚类预处理模块 Short text cluster

Programming Languages

139335 projects - #7 most used programming language

Labels

nlp cluster text-mining text-processing

Projects that are alternatives of or similar to Textcluster

parsing fixed width files content made easy

Stars: ✭ 12 (-89.57%)

Mutual labels: text-mining, text-processing

TextDatasetCleaner

🔬 Очистка датасетов от мусора (нормализация, препроцессинг)

Stars: ✭ 27 (-76.52%)

Mutual labels: text-mining, text-processing

자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.

Stars: ✭ 91 (-20.87%)

Mutual labels: text-mining, text-processing

A keyphrase extractor for Persian

Stars: ✭ 60 (-47.83%)

Mutual labels: text-mining, text-processing

Artificial Adversary

🗣️ Tool to generate adversarial text examples and test machine learning models against them

Stars: ✭ 348 (+202.61%)

Mutual labels: text-mining, text-processing

Text-Classification-LSTMs-PyTorch

The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Stars: ✭ 45 (-60.87%)

Mutual labels: text-mining, text-processing

An easy-to-use library to extract indices from texts.

Stars: ✭ 18 (-84.35%)

Mutual labels: text-mining, text-processing

Deduce: de-identification method for Dutch medical text

Stars: ✭ 40 (-65.22%)

Mutual labels: text-mining, text-processing

Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.

Stars: ✭ 48 (-58.26%)

Mutual labels: text-mining, text-processing

support-tickets-classification

This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en

Stars: ✭ 142 (+23.48%)

Mutual labels: text-mining, text-processing

Weaving analytical stories from text data

Stars: ✭ 12 (-89.57%)

Mutual labels: text-mining, text-processing

PipeIt is a text transformation, conversion, cleansing and extraction tool.

Stars: ✭ 57 (-50.43%)

Mutual labels: text-mining, text-processing

Extract indicators of compromise from text, including "escaped" ones.

Stars: ✭ 148 (+28.7%)

Mutual labels: text-mining, text-processing

corpusexplorer2.0

Korpuslinguistik war noch nie so einfach...

Stars: ✭ 16 (-86.09%)

Mutual labels: text-mining, text-processing

CogComp's light-weight Python NLP annotators

Stars: ✭ 115 (+0%)

Mutual labels: text-mining, text-processing

advanced-text-mining

TEANAPS 라이브러리를 활용한 자연어 처리와 텍스트 분석 방법론에 대해 다룹니다.

Stars: ✭ 15 (-86.96%)

Mutual labels: text-mining, text-processing

Text Mining in Python

Stars: ✭ 18 (-84.35%)

Mutual labels: text-mining, text-processing

Applied Text Mining In Python

Repo for Applied Text Mining in Python (coursera) by University of Michigan

Stars: ✭ 59 (-48.7%)

Mutual labels: text-mining, text-processing

Postgres high-availability cluster with auto-failover and automated cluster recovery.

Stars: ✭ 1,360 (+1082.61%)

Mutual labels: cluster

my tools working with redis

Stars: ✭ 104 (-9.57%)

Mutual labels: cluster

View All Similar Projects ➔

短文本聚类

项目介绍

短文本聚类是常用的文本预处理步骤，可以用于洞察文本常见模式、分析设计语义解析规范、加速相似句子查询等。本项目实现了内存友好的短文本聚类方法，并提供了相似句子查询接口。

依赖库

pip install tqdm jieba

使用方法

聚类

python cluster.py --infile ./data/infile \
--output ./data/output

具体参数设置可以参考cluster.py文件内_get_parser()函数参数说明，包含设置分词词典、停用词、匹配采样数、匹配度阈值等。

查询

参考search.py代码里Searcher类的使用方法，如果用于查询标注数据的场景，使用分隔符:::将句子与标注信息拼接起来。如我是海贼王:::(λx.海贼王)，处理时会只对句子进行匹配。

算法原理

文件路径

TextCluster
|      README.md
|      LICENSE
|      cluster.py                    聚类程序
|      search.py                     查询程序
|      
|------utils                         公共功能模块
|    |    __init__.py
|    |    segmentor.py               分词器封装
|    |    similar.py                 相似度计算函数
|    |    utils.py                   文件处理模块
|
|------data
|    |    infile                     默认输入文本路径，用于测试中文模式
|    |    infile_en                  默认输入文本路径，用于测试英文模式
|    |    seg_dict                   默认分词词典
|    |    stop_words                 默认停用词路径

注：本方法仅面向短文本，长文本聚类可根据需求选用SimHash, LDA等其他算法。

Text Cluster

Introduction

Text cluster is a normal preprocess procedure to analysis text feature. This project implements a memory friendly method only for short text cluster. For long text, it is preferable to choose SimHash or LDA or others according to demand.

Requirements

pip install tqdm spacy

Usage

Clustering

python cluster.py --infile ./data/infile_en \
--output ./data/output \
--lang en

For more configure arguments description, see _get_parser() in cluster.py, including stop words setting, sample number.

Search

Basic Idea

File Structure

TextCluster
|      README.md
|      LICENSE
|      cluster.py                    clustering function
|      search.py                     search function
|      
|------utils                         utilities
|    |    __init__.py
|    |    segmentor.py               tokenizer wrapper
|    |    similar.py                 similarity calculator
|    |    utils.py                   file process module
|
|------data
|    |    infile                     default input file path, to test Chinese mode
|    |    infile_en                  default input file path, to test English mode
|    |    seg_dict                   default tokenizer dict path
|    |    stop_words                 default stop words path

Other Language

For other specific language, modify tokenizer wrapper in ./utils/segmentor.py.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 115

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗