All Projects → csstools → tokenizer

csstools / tokenizer

Licence: CC0-1.0 license
Tokenize CSS according to the CSS Syntax

Programming Languages

typescript
32286 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to tokenizer

xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.
Stars: ✭ 26 (-50%)
Mutual labels:  tokenizer
psr2r-sniffer
A PSR-2-R code sniffer and code-style auto-correction-tool - including many useful additions
Stars: ✭ 32 (-38.46%)
Mutual labels:  tokenizer
berserker
Berserker - BERt chineSE woRd toKenizER
Stars: ✭ 17 (-67.31%)
Mutual labels:  tokenizer
gd-tokenizer
A small godot project with a tokenizer written in GDScript.
Stars: ✭ 34 (-34.62%)
Mutual labels:  tokenizer
hunspell
High-Performance Stemmer, Tokenizer, and Spell Checker for R
Stars: ✭ 101 (+94.23%)
Mutual labels:  tokenizer
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (+32.69%)
Mutual labels:  tokenizer
chinese-tokenizer
Tokenizes Chinese texts into words.
Stars: ✭ 72 (+38.46%)
Mutual labels:  tokenizer
vscode-blockman
VSCode extension to highlight nested code blocks
Stars: ✭ 233 (+348.08%)
Mutual labels:  tokenizer
lex
Lex is an implementation of lex tool in Ruby.
Stars: ✭ 49 (-5.77%)
Mutual labels:  tokenizer
jargon
Tokenizers and lemmatizers for Go
Stars: ✭ 98 (+88.46%)
Mutual labels:  tokenizer
SwiLex
A universal lexer library in Swift.
Stars: ✭ 29 (-44.23%)
Mutual labels:  tokenizer
tokenizer
A simple tokenizer in Ruby for NLP tasks.
Stars: ✭ 44 (-15.38%)
Mutual labels:  tokenizer
neural tokenizer
Tokenize English sentences using neural networks.
Stars: ✭ 64 (+23.08%)
Mutual labels:  tokenizer
python-mecab
A repository to bind mecab for Python 3.5+. Not using swig nor pybind. (Not Maintained Now)
Stars: ✭ 27 (-48.08%)
Mutual labels:  tokenizer
wink-tokenizer
Multilingual tokenizer that automatically tags each token with its type
Stars: ✭ 51 (-1.92%)
Mutual labels:  tokenizer
snapdragon-lexer
Converts a string into an array of tokens, with useful methods for looking ahead and behind, capturing, matching, et cetera.
Stars: ✭ 19 (-63.46%)
Mutual labels:  tokenizer
rustfst
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Stars: ✭ 104 (+100%)
Mutual labels:  tokenizer
ilmulti
Tooling to play around with multilingual machine translation for Indian Languages.
Stars: ✭ 19 (-63.46%)
Mutual labels:  tokenizer
liblex
C library for Lexical Analysis
Stars: ✭ 25 (-51.92%)
Mutual labels:  tokenizer
elasticsearch-plugins
Some native scoring script plugins for elasticsearch
Stars: ✭ 30 (-42.31%)
Mutual labels:  tokenizer

CSS Tokenizer

npm version build status code coverage issue tracker pull requests support chat

This tools lets you tokenize CSS according to the CSS Syntax Specification. Tokenizing CSS is separating a string of CSS into its smallest, semantic parts — otherwise known as tokens.

This tool is intended to be used in other tools on the front and back end. It seeks to maintain:

  • 100% compliance with the CSS syntax specification.
  • 100% code coverage. 🦺
  • 100% static typing. 💪
  • 1kB maximum contribution size. 📦
  • Superior quality over Shark P. 🦈

Usage

Add the CSS tokenizer to your project:

npm install @csstools/tokenizer

Tokenize CSS in JavaScript:

import { tokenize } from '@csstools/tokenizer'

for (const token of tokenize(cssText)) {
  console.log(token) // logs an individual CSSToken
}

Tokenize CSS in classical NodeJS:

const { tokenizer } = require('@csstools/tokenizer')

let iterator = tokenizer(cssText), iteration

while (!(iteration = iterator()).done) {
  console.log(iteration.value) // logs an individual CSSToken
}

Tokenize CSS in client-side scripts:

<script type="module">

import { tokenize } from 'https://unpkg.com/@csstools/tokenizer?module'

for (const token of tokenize(cssText)) {
  console.log(token) // logs an individual CSSToken
}

</script>

Tokenize CSS in classical client-side scripts:

<script src="http://unpkg.com/@csstools/tokenizer"></script>
<script>

const tokens = Array.from(tokenizeCSS(cssText)) // an array of CSSTokens

</script>

How it works

The CSS tokenizer separates a string of CSS into tokens.

interface CSSToken {
  /** Position in the string at which the token was retrieved. */
  tick: number

  /** Number identifying the kind of token. */
  type:
    | 1 // Symbol
    | 2 // Comment
    | 3 // Space
    | 4 // Word
    | 5 // Function
    | 6 // Atword
    | 7 // Hash
    | 8 // String
    | 9 // Number
  
  /** Code, like the character code of a symbol, or the character code of the opening parenthesis of a function. */
  code: number

  /** Lead, like the opening of a comment, the quotation mark of a string, or the name of a function. */
  lead: string,

  /** Data, like the numbers before a unit, the word after an at-sign, or the opening parenthesis of a Function. */
  data: string,

  /** Tail, like the unit after a number, or the closing of a comment. */
  tail: string,
}

As an example, the CSS string @media would become a Atword token where @ and media are recognized as distinct parts of that token. As another example, the CSS string 5px would become a Number token where 5 and px are recognized as distinct parts of that token. As a final example, the string 5px 10px would become 3 tokens; the Number as mentioned before (5px), a Space token that represents a single space ( ), and then another Number token (10px).

Benchmarks

As of August 23, 2021, these benchmarks were averaged from my local machine:

Benchmark: Tailwind CSS
  ┌────────────────────────────────────────────────────┬───────┬────────┬────────┐
  │                      (index)                       │  ms   │ ms/50k │ tokens │
  ├────────────────────────────────────────────────────┼───────┼────────┼────────┤
  │ CSSTree 1 x 35.04 ops/sec ±6.55% (64 runs sampled) │ 28.54 │  1.51  │ 946205 │
  │ CSSTree 2 x 41.76 ops/sec ±7.57% (58 runs sampled) │ 23.95 │  1.27  │ 946205 │
  │ PostCSS 8 x 14.18 ops/sec ±3.31% (40 runs sampled) │ 70.54 │  3.77  │ 935282 │
  │ Tokenizer x 17.40 ops/sec ±0.98% (48 runs sampled) │ 57.48 │  3.04  │ 946206 │
  └────────────────────────────────────────────────────┴───────┴────────┴────────┘

Benchmark: Bootstrap
  ┌───────────────────────────────────────────────────┬──────┬────────┬────────┐
  │                      (index)                      │  ms  │ ms/50k │ tokens │
  ├───────────────────────────────────────────────────┼──────┼────────┼────────┤
  │ CSSTree 1 x 600 ops/sec ±0.87% (96 runs sampled)  │ 1.67 │  1.41  │ 59236  │
  │ CSSTree 2 x 695 ops/sec ±0.08% (100 runs sampled) │ 1.44 │  1.21  │ 59236  │
  │ PostCSS 8 x 432 ops/sec ±0.94% (94 runs sampled)  │ 2.31 │  2.26  │ 51170  │
  │ Tokenizer x 288 ops/sec ±0.40% (93 runs sampled)  │ 3.48 │  2.93  │ 59237  │
  └───────────────────────────────────────────────────┴──────┴────────┴────────┘

Development

You wanna take a deeper dive? Awesome! Here are a few useful development commands.

npm run build

The build command creates all the files needed to run this tool in many different JavaScript environments.

npm run build

npm run benchmark

The benchmark command builds the project and then tests its performance as compared to PostCSS. These benchmarks are run against Boostrap and Tailwind CSS.

npm run benchmark

npm run test

The test command tests the coverage and accuracy of the tokenizer.

As of September 26, 2020, this tokenizer has 100% test coverage:

npm run test
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].