Alternatives and detailed information of xtextadd

These pages contain some proposed additions to Xtext.

Xtext is a very powerful way to create a project IDE from a grammar but I would like some additional capabilities. I will put some small demonstrator projects here in the hope of persuading the Xtext team to include these capabilities in Xtext.

These small stand-alone projects, that I would like to see built-in to Xtext, are:

Whitespaceblock - Adds the ability to define blocks in your DSL using whitespace (indentation).
Pbase - A tutorial for Whitespaceblock
Macro - Adds the ability to have macros in your DSL.
Xgener - A customisable generator for Xtext projects.

Python-like syntax

I have drafted out code to support DSLs that need to have a Python-like syntax.

To see how this works I suggest looking at this code:

An alternative Python-like version of common.Terminals here
When the user uses this in the grammar then it is proposed that this token source is used.

The effect of this is when the sourcecode contains whitespace like this:

a
	b
		c
	d

Then 'phantom tokens' will be automatically added which will be used by the parser to mark blocks. This is much easier than trying to use the grammar to parse whitespace. So the above input will look to the parser like:

a
	BEGIN b
		BEGIN c
	END d
END

More details:

The code is here.
I have put an enhancement request for this on Eclipse bug tracking system.
I have put a tutorial for converting your DSL to Python-like syntax on this page.

Macro

I would like better support to help DSLs that contain macros.

Note: I had a look at xtend implementation of macros and it looks very complicated, I am looking for something very simple which can be customised for different DSLs.

I have drafted an implementation here, which partly works, but there are some issues which I think need some changes to Xtext.

This code runs between the lexer and the parser, this is similar to the support for Python-like syntax request, so I think it is worth at least considering this at the same time.

Because of this the implementation of macros is very simple:

It works on the token stream and not the character stream. This means that the macro must contain whole tokens and not partial tokens. So we can't have a macro to substitute the contents of strings, for example.
These macros don't have a scope and will be applied at any point in the source after they are defined.
These macros don't have parameters, it is just a simple substitution.

So it works like this:

The first line is supposed to show a macro definition, this defines a macro 'x' which has the value 'a b c'

macro x a b c endmacro
d x
e x

The code in this project then substitutes every occurrence of 'x' by its value 'a b c'

d a b c
e a b c

Although this is very simple, it is good enough for my application and it may be a good starting point for other people also.

However this is a bit more complicated than the support for Python-like syntax request and this causes some problems. It is more complicated because:

In the Python case only simple PhantomTokens (begin block and end block) are added into the text stream, these don't need to carry any extra textual information. In the macro case we need to add phantom tokens with textual information. I think this need changes to Xtext to implement.
We also need to remove tokens from the token stream. I'm not sure of the best way to do this: make them hidden or leave them out altogether? (need to do some tests).

More details:

The code is here.
I have put an enhancement request for this on Eclipse bug tracking system.

Preprocessor

The above projects may need to be used in combination and they may need to be customised to change the syntax slightly to suit the specific DSL. So it would be better if they were part of a more general preprocessor designed to run after the lexer and before the parser. This is for situations where we don't want to write the lexer or parser completely by hand, we still want to use the grammar, but we want more customisation than is currently possible.

Ideally this would have a two-way mapping between the text stream used in the editor and the indexes used in the nodeModel.

Technical Background

Here are some technical notes which might help in understanding the code (I have put more extensive information on this site).

The output of the Xtext parser is two separate tree structures:

EMF model (semantic model) - used for validation and eventually code generation - I think this is the equivalent to the AST.
Node Model - This is read by the Eclipse JFace text editor component to display and enter the text. Leaves in this tree are tokens which point to a chunk of text stream which must be contiguous and non-overlapping with other tokens.

Each token can refer to two separate text values:

A start and end index into the text stream. This value will be used for the NodeModel.
An explicit text value. This will be used for the EMF model.

So each token can have two values. So if we take the macro example, the index for the macro will point to the macro name and the text value will contain the expansion of the macro.

PhantomToken

There is not an explicit mechanism for tokens which need to be used in the parser and will affect the EMF, but do not exist anywhere in the editor, such as the inserted curly brackets in the Python-like example above.

However we can cheat by making the start and stop indexes the same, this means that the token has little effect on the NodeModel. It is still important that the index values are contiguous with the tokens before and after it.

Regarding the node model and indexes into the character stream. There is a lot of scope for confusion in the way the node model (xtext code) and CommonToken (Antlr code) use different conventions for these indexes, neither is properly documented and to work on this code you need to understand both.

With all the potential for confusion (well it definitely confused me) I think this needs more explanation. My understanding is:

Stop index for CommonToken is zero-index of last character in this token.
Stop index for node model is zero-index of first character in next token. Internally it is stored as start and length but getEndOffset() method is provided.

For me, it helps to visualise it by thinking of the indexes as being between the characters, it is then very clear how compound nodes are calculated. For me, I find it easiest to understand the indexes into the text stream as representing the spaces between the characters, not the characters, like this:

Index:	0		1		2		3		4		5		6		7
text stream:		{		{		{		a		}		}		}

So the first character has index 0:1

The second 1:2 and so on.

This makes it easier to work out the indexes for composite nodes as well as leaf nodes. So, for example, the composite node holding the outer brackets is 0:7. The inner brackets are 2:5.

Xgener

More information abour Xgener on this page.

I often find that the DSLs (Domain Specific Languages) that I write have similar constructs, so I end up writing similar grammar rules and other code elements. It would be good to use Xbase but that is often not flexible enough to do what I want.

I need something inbetween, more flexible than Xbase, but easier than writing a full xtext grammar from scratch. That is why I have started to write Xgener.

Xgener is intended for languages that are fairly conventional in that they still have concepts like:

class
procedure/method
statement
expression

However, it will allow certain modifications to these concepts to give the flexibility in the DSLs I need.