Fast, linear version of CorEx for covariance estimation, dimensionality reduction, and subspace clustering with very under-sampled, high-dimensional data

Stars: ✭ 39 (+129.41%)

Mutual labels: unsupervised-learning

Indoor-SfMLearner

[ECCV'20] Patch-match and Plane-regularization for Unsupervised Indoor Depth Estimation

Stars: ✭ 115 (+576.47%)

Mutual labels: unsupervised-learning

DRNET

PyTorch implementation of the NIPS 2017 paper - Unsupervised Learning of Disentangled Representations from Video

Stars: ✭ 45 (+164.71%)

Mutual labels: unsupervised-learning

KD3A

Here is the official implementation of the model KD3A in paper "KD3A: Unsupervised Multi-Source Decentralized Domain Adaptation via Knowledge Distillation".

Stars: ✭ 63 (+270.59%)

Mutual labels: unsupervised-learning

bert tokenization for java

This is a java version of Chinese tokenization descried in BERT.

Stars: ✭ 39 (+129.41%)

Mutual labels: tokenization

treecut

Find nodes in hierarchical clustering that are statistically significant

Stars: ✭ 26 (+52.94%)

Mutual labels: unsupervised-learning

BaySMM

Model for learning document embeddings along with their uncertainties

Stars: ✭ 25 (+47.06%)

Mutual labels: unsupervised-learning

SymSpellCppPy

Fast SymSpell written in c++ and exposes to python via pybind11

Stars: ✭ 28 (+64.71%)

Mutual labels: word-segmentation

Discovery

Mining Discourse Markers for Unsupervised Sentence Representation Learning

Stars: ✭ 48 (+182.35%)

Mutual labels: unsupervised-learning

catgan pytorch

Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks

Stars: ✭ 50 (+194.12%)

Mutual labels: unsupervised-learning

auto-data-tokenize

Identify and tokenize sensitive data automatically using Cloud DLP and Dataflow

Stars: ✭ 21 (+23.53%)

Mutual labels: tokenization

machine-learning-course

Machine Learning Course @ Santa Clara University

Stars: ✭ 17 (+0%)

Mutual labels: unsupervised-learning

UETsegmenter

A toolkit for Vietnamese word segmentation

Stars: ✭ 60 (+252.94%)

Mutual labels: word-segmentation

simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Stars: ✭ 32 (+88.24%)

Mutual labels: tokenization

customized-symspell

Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

Stars: ✭ 51 (+200%)

Mutual labels: word-segmentation

NMFADMM

A sparsity aware implementation of "Alternating Direction Method of Multipliers for Non-Negative Matrix Factorization with the Beta-Divergence" (ICASSP 2014).

Stars: ✭ 39 (+129.41%)

Mutual labels: unsupervised-learning

CAIL2018-toy

The final teamwork of data mining course, CAIL-2018 competition. NOTE: this is just quite SIMPLE and TRIVIAL code.

Stars: ✭ 23 (+35.29%)

Mutual labels: npl

al-fk-self-supervision

Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Stars: ✭ 28 (+64.71%)

Mutual labels: unsupervised-learning

View All Similar Projects ➔

YouTokenToMe Ruby

YouTokenToMe - high performance unsupervised text tokenization - for Ruby

Learn more about how it works

Installation

Add this line to your application’s Gemfile:

gem "youtokentome"

Getting Started

Dump your text to a file

Blazingly fast tokenization!

Train a model

model = YouTokenToMe::BPE.train(data: "train.txt", model: "model.txt", vocab_size: 30000)

Load a model

model = YouTokenToMe::BPE.new("model.txt")

Get vocab

model.vocab

Encode

model.encode(sentences)

Decode

model.decode(ids)

Convert between ids and subwords

model.subword_to_id(subword)
model.id_to_subword(id)

Options

Train

YouTokenToMe::BPE.train(
  data: "train.txt",   # path to file with training data
  model: "model.txt",  # path to where the trained model will be saved
  vocab_size: 30000,   # number of tokens in the final vocabulary
  coverage: 1.0,       # fraction of characters covered by the model
  n_threads: -1,       # number of parallel threads used to run
  pad_id: 0,           # reserved id for padding
  unk_id: 1,           # reserved id for unknown symbols
  bos_id: 2,           # reserved id for begin of sentence token
  eos_id: 3            # reserved id for end of sentence token
)

Encode

model.encode(
  sentences,
  output_type: :id,    # or :subword
  bos: false,          # add "beginning of sentence" token
  eos: false,          # add "end of sentence" token
  reverse: false,      # reverse output sequence of tokens
  dropout_prob: 0.0    # BPE-dropout probability
)

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

To get started with development:

git clone https://github.com/ankane/youtokentome-ruby.git
cd youtokentome-ruby
bundle install
bundle exec rake compile
bundle exec rake test

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ankane / youtokentome-ruby

Programming Languages

Labels

Projects that are alternatives of or similar to youtokentome-ruby

YouTokenToMe Ruby

Installation

Getting Started

Options

History

Contributing