All Projects → gosukiwi → Pasukon

gosukiwi / Pasukon

Licence: mit
JavaScript practical parser generator library using combinators

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Pasukon

leftry
Leftry - A left-recursion enabled recursive-descent parser combinator library for Lua.
Stars: ✭ 32 (-70.09%)
Mutual labels:  parser-combinators, parser-generator
Lug
Parsing expression grammar (PEG) embedded domain specific language and parsing machine for C++17
Stars: ✭ 44 (-58.88%)
Mutual labels:  parser-combinators, parser-generator
Parsley
An exceptionally fast parser combinator library for Scala
Stars: ✭ 31 (-71.03%)
Mutual labels:  parser-combinators
Pegtl
Parsing Expression Grammar Template Library
Stars: ✭ 1,295 (+1110.28%)
Mutual labels:  parser-combinators
Parser Combinators From Scratch
Code that accompanies the series
Stars: ✭ 56 (-47.66%)
Mutual labels:  parser-combinators
Parseque
Total Parser Combinators in Coq
Stars: ✭ 37 (-65.42%)
Mutual labels:  parser-combinators
Combine Language
A crate which defines parsers for common programming language constructs using https://github.com/Marwes/combine
Stars: ✭ 71 (-33.64%)
Mutual labels:  parser-combinators
Parsel
Create complex parsers by combining simple ones with Parsel!
Stars: ✭ 21 (-80.37%)
Mutual labels:  parser-combinators
Parsec.el
A parser combinator library for Emacs Lisp, similar to Haskell's Parsec library.
Stars: ✭ 98 (-8.41%)
Mutual labels:  parser-combinators
Footlessparser
A simple parser combinator written in Swift
Stars: ✭ 60 (-43.93%)
Mutual labels:  parser-combinators
Ts Pegjs
Plugin for pegjs to generate TypeScript parsers.
Stars: ✭ 76 (-28.97%)
Mutual labels:  parser-generator
Funcj
Assorted functional-oriented data structures and algorithms for Java.
Stars: ✭ 60 (-43.93%)
Mutual labels:  parser-combinators
Parsing With Haskell Parser Combinators
🔍 A step-by-step guide to parsing using Haskell parser combinators.
Stars: ✭ 72 (-32.71%)
Mutual labels:  parser-combinators
Mini Haskell
A self-hosting mini Haskell compiler with a mini C runtime.
Stars: ✭ 37 (-65.42%)
Mutual labels:  parser-combinators
Nice Parser
Nice parsers in OCaml without the boilerplate
Stars: ✭ 91 (-14.95%)
Mutual labels:  parser-generator
Comby
A tool for structural code search and replace that supports ~every language.
Stars: ✭ 912 (+752.34%)
Mutual labels:  parser-combinators
Feel Scala
FEEL parser and interpreter written in Scala
Stars: ✭ 52 (-51.4%)
Mutual labels:  parser-combinators
Myna Parser
Myna Parsing Library
Stars: ✭ 69 (-35.51%)
Mutual labels:  parser-combinators
Opal
Self-contained monadic parser combinators for OCaml
Stars: ✭ 105 (-1.87%)
Mutual labels:  parser-combinators
Fall
Stars: ✭ 92 (-14.02%)
Mutual labels:  parser-generator

Pasukon

Pasukon generates parsers using an easy to learn grammar. It's based on parser combinators, and also implements a lexing step.

It is highly extensible (you can make your own lexer and combinators), has no external dependencies, and works in both Node.js and Browser.

Try it online!

By far the easiest way to get started is using Pasukon's online editor. Check it out!

Install

npm install -g pasukon

Usage

You can use Pasukon in several ways. The simplest one is to give it a grammar as a string:

const Pasukon = require('pasukon').Pasukon
const parser = new Pasukon(fs.readFileSync('my-grammar.pasukon').toString())
parser.parse('hello, world!')

To get the most out of Pasukon, though, consider compiling the grammar to enable several optimizations. You can compile the grammar using the online editor, or with the CLI:

pasukon my-grammar.pasukon grammar.js

You can then load it as usual:

const grammar = require('grammar')
const Pasukon = require('pasukon').Pasukon
const parser = new Pasukon(grammar)
parser.parse('hello, world!')

Browser Usage

If you are just using a browser, you can use the distributable build:

<script type="text/javascript" src="pasukon.dist.js"></script>
<script type="text/javascript" src="grammar.js"></script>
<script type="text/javascript">
  // `Pasukon` and `grammar` are globals defined in `pasukon.dist.js` and
  // `grammar.js` respectively
  const pasukon = new Pasukon(grammar)
  pasukon.parse('my input')
</script>

For anything but trivial usage, it's recommended to use Pasukon from NPM with a module system. The most popular options at the moment are Webpack and Browserify.

Options

You can optionally pass Pasukon an options object:

  • lexer: An instance of a Lexer to be used by the parser. Default: undefined. If none specified, it will use the built-in lexer in the grammar.
  • cache: Boolean. Default: false. When true, it will cache the parsers results.
  • start: String. Default: null. When given, it will start the parsing process from the specified rule.
  • debug: Boolean. Default: false. When true, it will use the logger to log each parsing step.
  • logger: Default: null. If specified, it will use this logger when debugging. A logger only needs a log(string) method.
  • combinators: Default: {}. If specified, it will include the custom combinators given. For more info, see Adding Your Own Combinators.
  • context: Default: {}. Sets the evaluation context. For more info, see Setting The Evaluation Context.
const result = new Pasukon(grammar, { cache: true, start: 'some-rule' }).parse('input')

Syntax

Pasukon syntax is rather simple. It's divided in two parts: Lexing and Parsing.

Lexing

Optionally, you can define a lexer inside a lex-/lex block. In the format [match|ignore] <token-name> <matcher>.

lex
  match  DEF        'def'
  match  extend     "can also use double quotes"
  match  IDENTIFIER /^[a-zA-Z]/
  match  NEWLINE    {^\n/}
  ignore WHITESPACE /^[ \t\r]+/
/lex

It supports two different matchers, the first one is just a string comparison. The second uses regex, and matches the input against it. It returns true only if the input starts with the given regex, that is, it has matched and the index is 0.

For performance reason, it's a good idea to start all regexes with ^ so the regex engine can stop as soon as possible.

For more complex lexers, you can build your own. All it needs is an object that implements *lex(input) and returns an array of Token.

Notice that *lex is a generator function that uses yield to return tokens in a lazy way.

The Token class lives in lib/lexer/token. It implements an is method (eg: token.is('NEWLINE')), a toString method -- used for error reporting, as well as col and line properties. You can use your own Token implementation or the one provided by Pasukon.

TODO: For more info on how to write a custom lexer, see the Wiki.

Parsing

For an in-depth documentation, check out the Wiki.

The parsing part of the grammar is simply a set or rules:

<rule-name>
  | <rule-body>
  | <rule-body>
  ;

You can use the built-in token parser to match a single token:

program
  | token a
  ;

That will make a new parser, called program, that will call the token built-in parser with the argument a. Now the program parser will match a single instance of the token a.

The token parser is special, because it uses tokens as parameters, instead of other parsers.

There are three ways of calling parsers. Unary Call:

program
  | many0 (token a)
  ;

Binary Call:

program
  | (token a) or (token b)
  ;

And Rule Call:

start
  | another-rule
  ;

A rule can be composed of any combination of those methods:

program
  | many0 ((token a) or (token b))
  | (token a) or ((token b) or (token c))
  ;

Notice that no matter how nested it is, an outmost rule is always either in unary, binary, or rule call format.

Another important thing to note is the use parentheses. Binary combinators all have the same precedence, so they are always executed from left to right:

program
  | (token a) or (token b) or (token c)
  // is the same as
  | (token a) or ((token b) or (token c))
  ;

If you need to change the priority, simply use parentheses. It's recommended to always use them in non-trivial rules to make things clear.

Syntactic Sugar 🍬

The grammar syntax provides some sugar 🍬 to make things easier. The most common is |, which is sugar for the or combinator:

program
  | (token a) or (token b)
  ;
// is the same as
program
  | token a
  | token b
  ;

All built-in unary parsers have a shortcut syntax. You can use : to call the token parser:

program
  | :a
  | :b
  ;

The shortcut syntax has high precedence, so it can go anywhere and there's no need for parentheses:

program
  | opt :a // this is a unary call to the `opt` parser
  | opt token a // this is a binary call to the `token` parser, which is not what we want, and is invalid
  ;

Available Combinators

  • many0: Unary. Take a parser and match it zero or more times.
  • many1: Unary. Take a parser and match it one or more times.
  • opt: Unary. Take a parser and match it once. If it fails, report success.
  • then: Binary. Match the parser on the left. If it succeeds, match the one on the right.
  • or: Binary. Match the parser on the left. If it fails, try matching the one on the right.

Shortcuts:

  • * for many0
  • + for many1
  • ? for opt
  • : for token

Putting it all together:

statement
  | :A then ?(:B or :C)
  | +:D
  ;

statements
  | *statement
  ;

The last rule will be considered the starting rule.

Evaluating Code

Rules can optionally evaluate some code:

statement
  | :A then ?(:B or :C)
  |> 'return { type: 'FOO', first: $1, second: $2 }'
  | :D
  ;

Only the outmost call evaluates the code. Binary calls populate two variables: $1 and $2, while unary and rule calls only populate $1.

You can build up complex results from simple ones as such:

name
  | many0 (:A or :B) 'return { type: "NAME", value: $1.join("") }'
  ;

statement
  | name then :C 'return { type: "STATEMENT", name: $1, c: $2 }'
  ;

Named Parameters

To make things easier to manage, you can use the special as combinator, which takes a parser on the left hand side, and an arbitrary name on the hand side. Then it will give you access to a variable of the same name when evaluating code. You can access it using $.<my-name>:

statement
  | (*(:A or :B) as :demo) then :C
  |> 'return { type: "STATEMENT", name: $.demo.join(""), c: $2 }'
  ;

Setting The Evaluation Context

You can also set the context in which the code in grammars gets evaluated by passing a context option:

const result = new Pasukon('grammar.g', {
  context: {
    foo: function () { return 2 }
  }
}).parse('input')

You can then access the context using $ctx:

name
  | :A 'return $ctx.foo($1)'
  ;

The rule above will return 2 if it matches, because foo returns 2. This is useful for building up nodes in an Abstract Syntax Tree.

Note that this wont play nice with caching though. Because the result of each parsing step is memoized, the code is executed only once. This might lead to some unexpected behavior.

Left Recursion

You have to be careful when defining rules that they are not left-recursive. That means, a rule cannot match itself first. It needs to match other things first.

// left recursion infinite loop
start
  | start
  ;

// this is fine
start
  | :a start
  ;

It also applies to rules that call other rules.

// left recursion infinite loop
// because `RULE_B` starts with `RULE_A` and `RULE_A` starts with `RULE_B`
RULE_A
  | RULE_B
  ;

RULE_B
  | RULE_A
  ;

The parser will check for left-recursion and fail if it can find it.

Writing Efficient Parsers

Each rule is executed from left to right, from top to bottom. If a particular branch fails, it retries from the beginning on another branch. Eg:

demo
  | (many0 :a) then :b
  | (many0 :a) then :c
  ;

For the input aaaac, the parser will match four a tokens, then try b. Because it finds a c instead, it goes back all the way, scans four a again, then scans c and succeeds.

A more efficient parser would be:

demo
  | (many0 :a) then (:b or :c)
  ;

Alternatively:

demo
  | (many0 :a) then tail
  ;

tail
  | :b
  | :c
  ;

Yet another alternative is enabling caching. Simply pass { cache: true } to the options and (many0 :a) will be evaluated only once (in the first example).

Note that memoization uses up more memory, and has the extra step of saving/checking each parse step. It's up to you whether you want to use it or not.

Luckly, it's very easy to toggle so once your grammar is done, try enabling/disabling it and see if you get any performance benefits.

Adding Your Own Combinators

const Pasukon = require('pasukon').Pasukon
const parser = new Pasukon('...', {
  combinators: {
    binary: { 'my-combinator-name': MyBinaryCombinator },
    unary: { 'another-name': MyUnaryCombinator }
  }
})

Unary combinators take a single argument, a parser, in the constructor. Binary combinators take an array of two parsers.

They must implement a parse method and return a Result object. This is a demo binary combinator:

class CustomThen extends BinaryCombinator {
  parse (tokens) {
    const lhs = this.parsers[0].parse(tokens)
    if (lhs.failed) return lhs

    const rhs = this.parsers[1].parse(lhs.remaining)
    if (rhs.succeeded) {
      rhs.matched = this.code ? this.eval(lhs.matched, rhs.matched) : [lhs.matched, rhs.matched]
      return rhs
    }

    return this.fail(tokens)
  }
}

And this is a custom unary combinator:

class CustomMany1 extends UnaryCombinator {
  parse (tokens) {
    if (tokens.isEmpty()) return this.fail(tokens)

    let remaining = tokens
    let matched = []
    while (!remaining.isEmpty()) {
      const result = this.parser.parse(remaining)
      if (result.matched !== null) matched = matched.concat(result.matched)

      if (result.failed) {
        if (matched.length === 0) {
          return this.fail(tokens)
        } else {
          return this.ok(remaining, this.code ? this.eval(matched) : matched)
        }
      }

      remaining = result.remaining
    }
  }
}

You can see example combinators in ./lib/parsers/combinators.

Development

Remember to npm install before anything else. If you make changes to ./lib/grammar.pasukon, rebuild the AST with

npm run grammar

Run tests with

npm run test

Watch tests with

npm run watch

To build the browser distribution files, run

npm run build

To run the benchmark suite, run

npm run benchmark

Why

Pasukon is designed to be simple to learn, use and extend. Inspired by the awesome PEGjs, I decided to create a similar parser generator that can handle caching + indent-based grammars by implementing a lexing step.

Pasukon's grammar syntax is quite simplistic, it provides a few rules and everything is built upon that. There's no left-recursion elimination or operator precedence. Instead, you need to define rules in the proper order (like PEG parsers).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].