All Projects → howl-anderson → Hanzi_char_featurizer

howl-anderson / Hanzi_char_featurizer

Licence: apache-2.0
汉字字符特征提取器 (featurizer),提取汉字的特征(发音特征、字形特征)用做深度学习的特征 | A Chinese character feature extractor, which extracts the features of Chinese characters (pronunciation features, glyph features) as features for deep learning

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Hanzi char featurizer

Home Credit Default Risk
Default risk prediction for Home Credit competition - Fast, scalable and maintainable SQL-based feature engineering pipeline
Stars: ✭ 68 (-63.64%)
Mutual labels:  feature-engineering
Feast
Feature Store for Machine Learning
Stars: ✭ 2,576 (+1277.54%)
Mutual labels:  feature-engineering
Remixautoml
R package for automation of machine learning, forecasting, feature engineering, model evaluation, model interpretation, data generation, and recommenders.
Stars: ✭ 159 (-14.97%)
Mutual labels:  feature-engineering
Blurr
Data transformations for the ML era
Stars: ✭ 96 (-48.66%)
Mutual labels:  feature-engineering
Datasist
A Python library for easy data analysis, visualization, exploration and modeling
Stars: ✭ 123 (-34.22%)
Mutual labels:  feature-engineering
Ppdai risk evaluation
“魔镜杯”风控算法大赛 拍拍贷风控模型,接近冠军分数
Stars: ✭ 144 (-22.99%)
Mutual labels:  feature-engineering
Awesome Feature Engineering
A curated list of feature engineering techniques for image and text machine learning
Stars: ✭ 45 (-75.94%)
Mutual labels:  feature-engineering
Feature Engineering Handbook
A practical feature engineering handbook
Stars: ✭ 181 (-3.21%)
Mutual labels:  feature-engineering
The Data Science Workshop
A New, Interactive Approach to Learning Data Science
Stars: ✭ 126 (-32.62%)
Mutual labels:  feature-engineering
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (-16.04%)
Mutual labels:  feature-engineering
Nni
An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Stars: ✭ 10,698 (+5620.86%)
Mutual labels:  feature-engineering
Auto ml
[UNMAINTAINED] Automated machine learning for analytics & production
Stars: ✭ 1,559 (+733.69%)
Mutual labels:  feature-engineering
Evalml
EvalML is an AutoML library written in python.
Stars: ✭ 145 (-22.46%)
Mutual labels:  feature-engineering
Kaggle Competitions
There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.
Stars: ✭ 86 (-54.01%)
Mutual labels:  feature-engineering
Transmogrifai
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
Stars: ✭ 2,084 (+1014.44%)
Mutual labels:  feature-engineering
Tpot
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Stars: ✭ 8,378 (+4380.21%)
Mutual labels:  feature-engineering
Complete Life Cycle Of A Data Science Project
Complete-Life-Cycle-of-a-Data-Science-Project
Stars: ✭ 140 (-25.13%)
Mutual labels:  feature-engineering
Hyperactive
A hyperparameter optimization and data collection toolbox for convenient and fast prototyping of machine-learning models.
Stars: ✭ 182 (-2.67%)
Mutual labels:  feature-engineering
Autofeat
Linear Prediction Model with Automated Feature Engineering and Selection Capabilities
Stars: ✭ 178 (-4.81%)
Mutual labels:  feature-engineering
Albedo
A recommender system for discovering GitHub repos, built with Apache Spark
Stars: ✭ 149 (-20.32%)
Mutual labels:  feature-engineering

汉字字符特征提取器(featurizer)

在深度学习中,很多场合需要提取汉字的特征(发音特征、字形特征)。本项目提供了一个通用的字符特征提取框架,并内建了 拼音字形(四角编码) 和 部首拆解 的特征。

特征提取器

  • 拼音特征提取器:提取汉字的拼音作为特征,发音相似的字在编码上应该相似。示例: -> ->
  • 字形(四角编码)提取器:提取中文的外形作为特征,相似的汉字在编码上应该相近。示例: -> 37001 -> 37101
  • 部首拆解提取器:提取汉字的偏旁部首拆解作为特征,相似的汉字在编码上应该相近。示例: -> ['门', '一'] -> ['门', '三']

使用

from hanzi_char_featurizer import Featurizor

featurizor = Featurizor()
result = featurizor.featurize('明天')
print(result)

输出

([['m'], ['t']], [['ing'], ['ian']], [['2'], ['1']], ('6', '1'), ('7', '0'), ('0', '8'), ('2', '0'), ('0', '4'))

结构解析

输出到 TensorFlow 作为 Tensor

import tensorflow as tf

import hanzi_char_featurizer

feature = hanzi_char_featurizer.featurize_as_tensor('./usage/data.txt')

with tf.Session() as sess:
    sess.run(tf.initializers.tables_initializer())
    for _ in range(1):
        print('+' * 20)
        data = sess.run(feature)
        print(data)

输出

++++++++++++++++++++
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

在使用 hanzi_char_featurizer 的公司列表



TODO

  • 增加 Unicode 的 IDS 表征,来自 爱奇艺 FASPell 模型
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].