Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → foonathan → Lex

foonathan / Lex

Licence: bsl-1.0

Replaced by foonathan/lexy

Programming Languages

227 projects

Labels

tokenizer lexer

Projects that are alternatives of or similar to Lex

snapdragon-lexer

Converts a string into an array of tokens, with useful methods for looking ahead and behind, capturing, matching, et cetera.

Stars: ✭ 19 (-86.13%)

Mutual labels: tokenizer, lexer

Parser Building Toolkit for JavaScript

Stars: ✭ 1,795 (+1210.22%)

Mutual labels: tokenizer, lexer

A universal lexer library in Swift.

Stars: ✭ 29 (-78.83%)

Mutual labels: tokenizer, lexer

C++ Lexer Toolkit Library (LexerTk) https://www.partow.net/programming/lexertk/index.html

Stars: ✭ 26 (-81.02%)

Mutual labels: tokenizer, lexer

🌿 NodeJS PHP Parser - extract AST or tokens (PHP5 and PHP7)

Stars: ✭ 400 (+191.97%)

Mutual labels: tokenizer, lexer

Lex is an implementation of lex tool in Ruby.

Stars: ✭ 49 (-64.23%)

Mutual labels: tokenizer, lexer

A modern CSS value compiler in JavaScript

Stars: ✭ 39 (-71.53%)

Mutual labels: tokenizer, lexer

pascal-interpreter

A simple interpreter for a large subset of Pascal language written for educational purposes

Stars: ✭ 21 (-84.67%)

Mutual labels: tokenizer, lexer

The fast scanner generator for Java™ with full Unicode support

Stars: ✭ 380 (+177.37%)

Mutual labels: tokenizer, lexer

Lex machinary for go.

Stars: ✭ 335 (+144.53%)

Mutual labels: tokenizer, lexer

SNL(Small Nested Language) Compiler. Maven jUnit Tokenizer Lexer Syntax Parser. 编译原理词法分析语法分析

Stars: ✭ 19 (-86.13%)

Mutual labels: tokenizer, lexer

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

Stars: ✭ 434 (+216.79%)

Mutual labels: tokenizer, lexer

Collection of developer toolkits

Stars: ✭ 131 (-4.38%)

Mutual labels: tokenizer, lexer

A tokenizer and sentence splitter for German and English web and social media texts.

Stars: ✭ 85 (-37.96%)

Mutual labels: tokenizer

Source code tokenizer

Stars: ✭ 119 (-13.14%)

Mutual labels: tokenizer

Simple yet helpful library for writing Django urls by an easy, short and intuitive way.

Stars: ✭ 85 (-37.96%)

Mutual labels: tokenizer

PHP standards checker.

Stars: ✭ 82 (-40.15%)

Mutual labels: tokenizer

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Stars: ✭ 125 (-8.76%)

Mutual labels: tokenizer

Parse YARA rules and operate over them more easily.

Stars: ✭ 108 (-21.17%)

Mutual labels: lexer

Sentence Splitter

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

Stars: ✭ 82 (-40.15%)

Mutual labels: tokenizer

View All Similar Projects ➔

foonathan/lex

Note: Replaced by foonathan/lexy.

This library is a C++14 constexpr tokenization and (in the future) parsing library. The tokens are specified in the type system so they are available at compile-time. With this information a trie is constructed that efficiently matches the input.

Basic Example

The tokens for a simple calculator:

using tokens = lex::token_spec<struct variable, struct plus, struct minus, …>;

struct variable : lex::rule_token<variable, tokens>
{
    static constexpr auto rule() const noexcept
    {
        // variables consists of one or more characters
        return lex::token_rule::plus(lex::ascii::is_alpha);
    }
};

struct plus : lex::literal_token<'+'>
{};

struct minus : lex::literal_token<'-'>
{};

See example/ctokenizer.cpp for an annotated example and tutorial.

Features

Declarative token specification: No need to worry about ordering or implementing lexing by hand.
Fast: Performance is comparable or faster to a handwritten state machine, see benchmarks.
Lightweight: No memory allocation, tokens are just string views into the input.
Lazy: The lex::tokenizer will just tokenize the next token in the input.
Fully constexpr: The entire lexing can happen at compile-time.
Flexible error handling: On invalid input, a lex::error_token is created consuming one characters. The parser can then decide how an error should be handled.

FAQ

Q: Isn't the name lex already taken?

A: It is. That's why the library is called foonathan/lex. In my defense, naming is hard. I could come up with some cute name, but then its not really descriptive. If you know foonathan/lex, you know what the project is about.

Q: Sounds great, but what about compile-time?

A: Compiling the foonathan_lex_ctokenizer target, which contains an implementation of a tokenizer for C (modulo some details), takes under three seconds. Just including <iostream> takes about half a second, including <iostream> and <regex> takes about two seconds. So the compile time is noticeable, but as a tokenizer will not be used in a lot of files of the project and rarely changes, acceptable.

In the future, I will probably look at optimizing it as well.

Q: My lex::rule_token doesn't seem to be matched?

A: This could be due to one of two things:

Multiple rule tokens would match the input. Then the tokenizer just picks the one that comes first. Make sure that all rule tokens are mutually exclusive, maybe by using lex::null_token and creating them all in one place at necessary. See int_literal and float_literal in the C tokenizer for an example.
A literal token is a prefix of the rule token, e.g. a C comment /* … */ and the / operator are in conflict. By default, the literal token is preferred in that case. Implement is_conflicting_literal() in your rule token as done by the comment token in the C tokenizer.

A mode to test for this issues is planned.

Q: The lex::tokenizer gives me just the next token, how do I implement lookahead for specific tokens?

A: Simple call get() until you've reached the token you want to lookahead, then reset() the tokenizer to the earlier position.

Q: How does it compare to compile-time-regular-expressions?

A: That project implements a RegEx parser at compile-time, which can be used to match strings. foonathan/lex is project is purely designed to tokenize strings. You could implement a tokenizer with the compile-time RegEx but I have choosen a different approach.

Q: How does it compare to PEGTL?

A: That project implements matching parsing expression grammars (PEGs), which are a more powerful RegEx, basically. On top of that they've implemented a parsing interface, so you can create a parse tree, for example. foonathan/lex currently does just tokenization, but I plan on adding parse rules on top of the tokens later on. Complex tokens in foonathan/lex can be described using PEG as well, but here the PEGs are described using operator overloading and functions, and in PEGTL they are described by the type system.

Q: It breaks when I do this!

A: Don't do that. And file an issue (or a PR, I have a lot of other projects...).

Q: This is awesome!

A: Thanks. I do have a Patreon page, so consider checking it out:

Documentation

Tutorial and reference documentation can be found here.

Compiler Support

The library requires a C++14 compiler with reasonable constexpr support. Compilers that are being tested on CI:

Linux:
- GCC 5 to 8, but compile-time parsing is not supported for GCC < 8 (still works at runtime)
- clang 4 to 7
MacOS:
- XCode 9 and 10
Windows:
- Visual Studio 2017, but compile-time parsing sometimes doesn't work (still works at runtime)

Installation

The library is header-only and requires my debug_assert library as well as the (header-only and standalone) Boost.mp11.

Using CMake `add_subdirectory()`:

Download and call add_subdirectory(). It will look for the dependencies with find_package(), if they're not found, the git submodules will be used.

Then link to foonathan::foonathan_lex.

Using CMake `find_package()`:

Download and install, setting the CMake variable FOONATHAN_LEX_FORCE_FIND_PACKAGE=ON. This requires the dependencies to be installed as well.

Then call find_package(foonathan_lex) and link to foonathan::foonathan_lex.

With other buildsystems:

You need to set the following options:

Enable C++14
Add the include path, so #include <debug_assert.hpp> works
Add the include path, so #include <boost/mp11/mp11.hpp> works
Add the include path, so #include <foonathan/lex/tokenizer.hpp> works

Planned Features

Parser on top of the tokenizer
Integrated way to handle data associated with tokens (like the value of an integer literal)
Optimize compile-time

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 137

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗