All Projects → ankane → youtokentome-ruby

ankane / youtokentome-ruby

Licence: MIT License
High performance unsupervised text tokenization for Ruby

Programming Languages

ruby
36898 projects - #4 most used programming language
C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to youtokentome-ruby

esapp
An unsupervised Chinese word segmentation tool.
Stars: ✭ 13 (-23.53%)
Mutual labels:  unsupervised-learning, word-segmentation
deep-INFOMAX
Chainer implementation of deep-INFOMAX
Stars: ✭ 32 (+88.24%)
Mutual labels:  unsupervised-learning
LinearCorex
Fast, linear version of CorEx for covariance estimation, dimensionality reduction, and subspace clustering with very under-sampled, high-dimensional data
Stars: ✭ 39 (+129.41%)
Mutual labels:  unsupervised-learning
Indoor-SfMLearner
[ECCV'20] Patch-match and Plane-regularization for Unsupervised Indoor Depth Estimation
Stars: ✭ 115 (+576.47%)
Mutual labels:  unsupervised-learning
DRNET
PyTorch implementation of the NIPS 2017 paper - Unsupervised Learning of Disentangled Representations from Video
Stars: ✭ 45 (+164.71%)
Mutual labels:  unsupervised-learning
KD3A
Here is the official implementation of the model KD3A in paper "KD3A: Unsupervised Multi-Source Decentralized Domain Adaptation via Knowledge Distillation".
Stars: ✭ 63 (+270.59%)
Mutual labels:  unsupervised-learning
bert tokenization for java
This is a java version of Chinese tokenization descried in BERT.
Stars: ✭ 39 (+129.41%)
Mutual labels:  tokenization
treecut
Find nodes in hierarchical clustering that are statistically significant
Stars: ✭ 26 (+52.94%)
Mutual labels:  unsupervised-learning
BaySMM
Model for learning document embeddings along with their uncertainties
Stars: ✭ 25 (+47.06%)
Mutual labels:  unsupervised-learning
SymSpellCppPy
Fast SymSpell written in c++ and exposes to python via pybind11
Stars: ✭ 28 (+64.71%)
Mutual labels:  word-segmentation
Discovery
Mining Discourse Markers for Unsupervised Sentence Representation Learning
Stars: ✭ 48 (+182.35%)
Mutual labels:  unsupervised-learning
catgan pytorch
Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks
Stars: ✭ 50 (+194.12%)
Mutual labels:  unsupervised-learning
auto-data-tokenize
Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow
Stars: ✭ 21 (+23.53%)
Mutual labels:  tokenization
machine-learning-course
Machine Learning Course @ Santa Clara University
Stars: ✭ 17 (+0%)
Mutual labels:  unsupervised-learning
UETsegmenter
A toolkit for Vietnamese word segmentation
Stars: ✭ 60 (+252.94%)
Mutual labels:  word-segmentation
simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (+88.24%)
Mutual labels:  tokenization
customized-symspell
Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
Stars: ✭ 51 (+200%)
Mutual labels:  word-segmentation
NMFADMM
A sparsity aware implementation of "Alternating Direction Method of Multipliers for Non-Negative Matrix Factorization with the Beta-Divergence" (ICASSP 2014).
Stars: ✭ 39 (+129.41%)
Mutual labels:  unsupervised-learning
CAIL2018-toy
The final teamwork of data mining course, CAIL-2018 competition. NOTE: this is just quite SIMPLE and TRIVIAL code.
Stars: ✭ 23 (+35.29%)
Mutual labels:  npl
al-fk-self-supervision
Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"
Stars: ✭ 28 (+64.71%)
Mutual labels:  unsupervised-learning

YouTokenToMe Ruby

YouTokenToMe - high performance unsupervised text tokenization - for Ruby

Learn more about how it works

Build Status

Installation

Add this line to your application’s Gemfile:

gem "youtokentome"

Getting Started

Dump your text to a file

Blazingly fast tokenization!

Train a model

model = YouTokenToMe::BPE.train(data: "train.txt", model: "model.txt", vocab_size: 30000)

Load a model

model = YouTokenToMe::BPE.new("model.txt")

Get vocab

model.vocab

Encode

model.encode(sentences)

Decode

model.decode(ids)

Convert between ids and subwords

model.subword_to_id(subword)
model.id_to_subword(id)

Options

Train

YouTokenToMe::BPE.train(
  data: "train.txt",   # path to file with training data
  model: "model.txt",  # path to where the trained model will be saved
  vocab_size: 30000,   # number of tokens in the final vocabulary
  coverage: 1.0,       # fraction of characters covered by the model
  n_threads: -1,       # number of parallel threads used to run
  pad_id: 0,           # reserved id for padding
  unk_id: 1,           # reserved id for unknown symbols
  bos_id: 2,           # reserved id for begin of sentence token
  eos_id: 3            # reserved id for end of sentence token
)

Encode

model.encode(
  sentences,
  output_type: :id,    # or :subword
  bos: false,          # add "beginning of sentence" token
  eos: false,          # add "end of sentence" token
  reverse: false,      # reverse output sequence of tokens
  dropout_prob: 0.0    # BPE-dropout probability
)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/youtokentome-ruby.git
cd youtokentome-ruby
bundle install
bundle exec rake compile
bundle exec rake test
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].