All Projects → foonathan → Lex

foonathan / Lex

Licence: bsl-1.0
Replaced by foonathan/lexy

Programming Languages

cplusplus
227 projects

Projects that are alternatives of or similar to Lex

snapdragon-lexer
Converts a string into an array of tokens, with useful methods for looking ahead and behind, capturing, matching, et cetera.
Stars: ✭ 19 (-86.13%)
Mutual labels:  tokenizer, lexer
Chevrotain
Parser Building Toolkit for JavaScript
Stars: ✭ 1,795 (+1210.22%)
Mutual labels:  tokenizer, lexer
SwiLex
A universal lexer library in Swift.
Stars: ✭ 29 (-78.83%)
Mutual labels:  tokenizer, lexer
lexertk
C++ Lexer Toolkit Library (LexerTk) https://www.partow.net/programming/lexertk/index.html
Stars: ✭ 26 (-81.02%)
Mutual labels:  tokenizer, lexer
Php Parser
🌿 NodeJS PHP Parser - extract AST or tokens (PHP5 and PHP7)
Stars: ✭ 400 (+191.97%)
Mutual labels:  tokenizer, lexer
lex
Lex is an implementation of lex tool in Ruby.
Stars: ✭ 49 (-64.23%)
Mutual labels:  tokenizer, lexer
bredon
A modern CSS value compiler in JavaScript
Stars: ✭ 39 (-71.53%)
Mutual labels:  tokenizer, lexer
pascal-interpreter
A simple interpreter for a large subset of Pascal language written for educational purposes
Stars: ✭ 21 (-84.67%)
Mutual labels:  tokenizer, lexer
Jflex
The fast scanner generator for Java™ with full Unicode support
Stars: ✭ 380 (+177.37%)
Mutual labels:  tokenizer, lexer
Lexmachine
Lex machinary for go.
Stars: ✭ 335 (+144.53%)
Mutual labels:  tokenizer, lexer
Snl Compiler
SNL(Small Nested Language) Compiler. Maven jUnit Tokenizer Lexer Syntax Parser. 编译原理 词法分析 语法分析
Stars: ✭ 19 (-86.13%)
Mutual labels:  tokenizer, lexer
Moo
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
Stars: ✭ 434 (+216.79%)
Mutual labels:  tokenizer, lexer
Works For Me
Collection of developer toolkits
Stars: ✭ 131 (-4.38%)
Mutual labels:  tokenizer, lexer
Somajo
A tokenizer and sentence splitter for German and English web and social media texts.
Stars: ✭ 85 (-37.96%)
Mutual labels:  tokenizer
Tokenizer
Source code tokenizer
Stars: ✭ 119 (-13.14%)
Mutual labels:  tokenizer
Djurl
Simple yet helpful library for writing Django urls by an easy, short and intuitive way.
Stars: ✭ 85 (-37.96%)
Mutual labels:  tokenizer
Hippo
PHP standards checker.
Stars: ✭ 82 (-40.15%)
Mutual labels:  tokenizer
Fugashi
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
Stars: ✭ 125 (-8.76%)
Mutual labels:  tokenizer
Plyara
Parse YARA rules and operate over them more easily.
Stars: ✭ 108 (-21.17%)
Mutual labels:  lexer
Sentence Splitter
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.
Stars: ✭ 82 (-40.15%)
Mutual labels:  tokenizer

foonathan/lex

Project Status Build Status

Note: Replaced by foonathan/lexy.

This library is a C++14 constexpr tokenization and (in the future) parsing library. The tokens are specified in the type system so they are available at compile-time. With this information a trie is constructed that efficiently matches the input.

Basic Example

The tokens for a simple calculator:

using tokens = lex::token_spec<struct variable, struct plus, struct minus, >;

struct variable : lex::rule_token<variable, tokens>
{
    static constexpr auto rule() const noexcept
    {
        // variables consists of one or more characters
        return lex::token_rule::plus(lex::ascii::is_alpha);
    }
};

struct plus : lex::literal_token<'+'>
{};

struct minus : lex::literal_token<'-'>
{};

See example/ctokenizer.cpp for an annotated example and tutorial.

Features

  • Declarative token specification: No need to worry about ordering or implementing lexing by hand.
  • Fast: Performance is comparable or faster to a handwritten state machine, see benchmarks.
  • Lightweight: No memory allocation, tokens are just string views into the input.
  • Lazy: The lex::tokenizer will just tokenize the next token in the input.
  • Fully constexpr: The entire lexing can happen at compile-time.
  • Flexible error handling: On invalid input, a lex::error_token is created consuming one characters. The parser can then decide how an error should be handled.

FAQ

Q: Isn't the name lex already taken?

A: It is. That's why the library is called foonathan/lex. In my defense, naming is hard. I could come up with some cute name, but then its not really descriptive. If you know foonathan/lex, you know what the project is about.

Q: Sounds great, but what about compile-time?

A: Compiling the foonathan_lex_ctokenizer target, which contains an implementation of a tokenizer for C (modulo some details), takes under three seconds. Just including <iostream> takes about half a second, including <iostream> and <regex> takes about two seconds. So the compile time is noticeable, but as a tokenizer will not be used in a lot of files of the project and rarely changes, acceptable.

In the future, I will probably look at optimizing it as well.

Q: My lex::rule_token doesn't seem to be matched?

A: This could be due to one of two things:

  • Multiple rule tokens would match the input. Then the tokenizer just picks the one that comes first. Make sure that all rule tokens are mutually exclusive, maybe by using lex::null_token and creating them all in one place at necessary. See int_literal and float_literal in the C tokenizer for an example.
  • A literal token is a prefix of the rule token, e.g. a C comment /* … */ and the / operator are in conflict. By default, the literal token is preferred in that case. Implement is_conflicting_literal() in your rule token as done by the comment token in the C tokenizer.

A mode to test for this issues is planned.

Q: The lex::tokenizer gives me just the next token, how do I implement lookahead for specific tokens?

A: Simple call get() until you've reached the token you want to lookahead, then reset() the tokenizer to the earlier position.

Q: How does it compare to compile-time-regular-expressions?

A: That project implements a RegEx parser at compile-time, which can be used to match strings. foonathan/lex is project is purely designed to tokenize strings. You could implement a tokenizer with the compile-time RegEx but I have choosen a different approach.

Q: How does it compare to PEGTL?

A: That project implements matching parsing expression grammars (PEGs), which are a more powerful RegEx, basically. On top of that they've implemented a parsing interface, so you can create a parse tree, for example. foonathan/lex currently does just tokenization, but I plan on adding parse rules on top of the tokens later on. Complex tokens in foonathan/lex can be described using PEG as well, but here the PEGs are described using operator overloading and functions, and in PEGTL they are described by the type system.

Q: It breaks when I do this!

A: Don't do that. And file an issue (or a PR, I have a lot of other projects...).

Q: This is awesome!

A: Thanks. I do have a Patreon page, so consider checking it out:

Patreon

Documentation

Tutorial and reference documentation can be found here.

Compiler Support

The library requires a C++14 compiler with reasonable constexpr support. Compilers that are being tested on CI:

  • Linux:
    • GCC 5 to 8, but compile-time parsing is not supported for GCC < 8 (still works at runtime)
    • clang 4 to 7
  • MacOS:
    • XCode 9 and 10
  • Windows:
    • Visual Studio 2017, but compile-time parsing sometimes doesn't work (still works at runtime)

Installation

The library is header-only and requires my debug_assert library as well as the (header-only and standalone) Boost.mp11.

Using CMake add_subdirectory():

Download and call add_subdirectory(). It will look for the dependencies with find_package(), if they're not found, the git submodules will be used.

Then link to foonathan::foonathan_lex.

Using CMake find_package():

Download and install, setting the CMake variable FOONATHAN_LEX_FORCE_FIND_PACKAGE=ON. This requires the dependencies to be installed as well.

Then call find_package(foonathan_lex) and link to foonathan::foonathan_lex.

With other buildsystems:

You need to set the following options:

  • Enable C++14
  • Add the include path, so #include <debug_assert.hpp> works
  • Add the include path, so #include <boost/mp11/mp11.hpp> works
  • Add the include path, so #include <foonathan/lex/tokenizer.hpp> works

Planned Features

  • Parser on top of the tokenizer
  • Integrated way to handle data associated with tokens (like the value of an integer literal)
  • Optimize compile-time
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].