Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → mathewsanders → Mustard

mathewsanders / Mustard

Licence: mit

🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Programming Languages

swift

15916 projects

Labels

tokenizer

Projects that are alternatives of or similar to Mustard

PaddleTokenizer

使用 PaddlePaddle 实现基于深度神经网络的中文分词引擎 | A DNN Chinese Tokenizer by Using PaddlePaddle

Stars: ✭ 14 (-97.97%)

Mutual labels: tokenizer

Friso

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other programs, like: MySQL, PostgreSQL, PHP, etc.

Stars: ✭ 313 (-54.57%)

Mutual labels: tokenizer

Smoothnlp

专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference

Stars: ✭ 435 (-36.87%)

Mutual labels: tokenizer

Hebrew-Tokenizer

A very simple python tokenizer for Hebrew text.

Stars: ✭ 16 (-97.68%)

Mutual labels: tokenizer

Sacremoses

Python port of Moses tokenizer, truecaser and normalizer

Stars: ✭ 293 (-57.47%)

Mutual labels: tokenizer

Jflex

The fast scanner generator for Java™ with full Unicode support

Stars: ✭ 380 (-44.85%)

Mutual labels: tokenizer

bredon

A modern CSS value compiler in JavaScript

Stars: ✭ 39 (-94.34%)

Mutual labels: tokenizer

Kagome

Self-contained Japanese Morphological Analyzer written in pure Go

Stars: ✭ 554 (-19.59%)

Mutual labels: tokenizer

Sentences

A multilingual command line sentence tokenizer in Golang

Stars: ✭ 293 (-57.47%)

Mutual labels: tokenizer

Ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Stars: ✭ 433 (-37.16%)

Mutual labels: tokenizer

ArabicProcessingCog

A Python package that do stemming, tokenization, sentence breaking, segmentation, normalization, POS tagging for Arabic language.

Stars: ✭ 19 (-97.24%)

Mutual labels: tokenizer

Jumanpp

Juman++ (a Morphological Analyzer Toolkit)

Stars: ✭ 254 (-63.13%)

Mutual labels: tokenizer

Php Parser

🌿 NodeJS PHP Parser - extract AST or tokens (PHP5 and PHP7)

Stars: ✭ 400 (-41.94%)

Mutual labels: tokenizer

cang-jie

Chinese tokenizer for tantivy, based on jieba-rs

Stars: ✭ 48 (-93.03%)

Mutual labels: tokenizer

Open Korean Text

Open Korean Text Processor - An Open-source Korean Text Processor

Stars: ✭ 438 (-36.43%)

Mutual labels: tokenizer

text2text

Text2Text: Cross-lingual natural language processing and generation toolkit

Stars: ✭ 188 (-72.71%)

Mutual labels: tokenizer

Lexmachine

Lex machinary for go.

Stars: ✭ 335 (-51.38%)

Mutual labels: tokenizer

Soynlp

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

Stars: ✭ 613 (-11.03%)

Mutual labels: tokenizer

Tokenizer

A small library for converting tokenized PHP source code into XML (and potentially other formats)

Stars: ✭ 4,770 (+592.31%)

Mutual labels: tokenizer

Moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

Stars: ✭ 434 (-37.01%)

Mutual labels: tokenizer

View All Similar Projects ➔

Mustard 🌭

Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Quick start using character sets

Foundation includes the String method components(separatedBy:) that allows us to get substrings divided up by certain characters:

let sentence = "hello 2017 year"
let words = sentence.components(separatedBy: .whitespaces)
// words.count -> 3
// words = ["hello", "2017", "year"]

Mustard provides a similar feature, but with the opposite approach, where instead of matching by separators you can match by one or more character sets, which is useful if separators simply don't exist:

import Mustard

let sentence = "hello2017year"
let words = sentence.components(matchedWith: .letters, .decimalDigits)
// words.count -> 3
// words = ["hello", "2017", "year"]

If you want more than just the substrings, you can use the tokens(matchedWith: CharacterSet...) method which will return an array of TokenType.

As a minimum, TokenType requires properties for text (the substring matched), and range (the range of the substring in the original string). When using CharacterSets as a tokenizer, the more specific type CharacterSetToken is returned, which includes the property set which contains the instance of CharacterSet that was used to create the match.

import Mustard

let tokens = "123Hello world&^45.67".tokens(matchedWith: .decimalDigits, .letters)
// tokens: [CharacterSet.Token]
// tokens.count -> 5 (characters '&', '^', and '.' are ignored)
//
// second token..
// token[1].text -> "Hello"
// token[1].range -> Range<String.Index>(3..<8)
// token[1].set -> CharacterSet.letters
//
// last token..
// tokens[4].text -> "67"
// tokens[4].range -> Range<String.Index>(19..<21)
// tokens[4].set -> CharacterSet.decimalDigits

Advanced matching with custom tokenizers

Mustard can do more than match from character sets. You can create your own tokenizers with more sophisticated matching behavior by implementing the TokenizerType and TokenType protocols.

Here's an example of using DateTokenizer (see example for implementation) that finds substrings that match a MM/dd/yy format.

DateTokenizer returns tokens with the type DateToken. Along with the substring text and range, DateToken includes a Date object corresponding to the date in the substring:

import Mustard

let text = "Serial: #YF 1942-b 12/01/17 (Scanned) 12/03/17 (Arrived) ref: 99/99/99"

let tokens = text.tokens(matchedWith: DateTokenizer())
// tokens: [DateTokenizer.Token]
// tokens.count -> 2
// ('99/99/99' is *not* matched by `DateTokenizer` because it's not a valid date)
//
// first date
// tokens[0].text -> "12/01/17"
// tokens[0].date -> Date(2017-12-01 05:00:00 +0000)
//
// last date
// tokens[1].text -> "12/03/17"
// tokens[1].date -> Date(2017-12-03 05:00:00 +0000)

Documentation & Examples

Roadmap

[x] Include detailed examples and documentation
[x] Ability to skip/ignore characters within match
[x] Include more advanced pattern matching for matching tokens
[x] Make project logo 🌭
[x] Performance testing / benchmarking against Scanner
[ ] Include interface for working with Character tokenizers

Requirements

Swift 4.1

Author

Made with ❤️ by @permakittens

Contributing

Feedback, or contributions for bug fixing or improvements are welcome. Feel free to submit a pull request or open an issue.

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 689

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗