Thursday, September 29, 2011

ANTLR Tutorial - Expression Language

ANTLR tool is useful any time you need to create compiler, interpreter or parser of your own language. It takes so called grammar file as an input and generates two classes: lexer and parser.

This post explains what is lexer, what is parser and how to write grammar files to generate them. In the end of the post, you will be able to create a compiler into abstract syntax tree for any simple programming language.

Our example project, a boolean expression language, is written in Java and available on Github. Besides that, everything explained in this post is language independent. Grammar files are the same in all languages.

This is our second post on the topic. First post showed how to use ANTLR in a maven project, how to add error handling to it and how to create a 'Hello World' language parser in Java. Posts are independent - you do not have to read the first one to understand the second one.

Table of Contents

First part shows how to:

Second part contains:

Overview

The overview is taken from the previous article. Skip it if you wish so. ANTLR generates two classes from grammar file: lexer and parser.

Lexer runs first and splits input into pieces called tokens. Each token represents more or less meaningful piece of input. The stream of tokes is passed to parser which does all necessary work. It is the parser who builds abstract syntax tree, interprets the code or translate it into some other form.

Grammar file contains everything ANTLR needs to generate correct lexer and parser. Whether it should generate java or python classes, whether parser generates abstract syntax tree, assembler code, directly interprets code and so on. As this tutorial shows how to build abstract syntax tree, we will ignore other options in following explanations.

Most importantly, grammar file describes how to split input into tokens and how to build tree from tokens. In other words, grammar file contains lexer rules and parser rules.

Lexer - Basics

Lexer takes characters stream as an input and splits it into stream of tokens. Lexer rules define tokens, each rule represents one token. Lexer rule always starts with a token name. Token name must start with an uppercase letter and is followed by regular expression that describes it:
TokenName: regular expression;   

Regular expression grammar is described in the first part of this chapter. Second part explains how lexer works. We wont go into details, this chapter contains only necessary minimum needed to understand and use ANTLR. Finally, we show how to reference lexer rules within other rules and how to get rid of white spaces.

Regular Expressions
ANTLR lexer uses standard regular expressions grammar. Character sequence must be enclosed in '' quotes. For example, this lexer rule matches salutation 'Hello':
SALUTATION: 'Hello';   

Use '|' to express alternatives. Lexer rule matching either salutation 'Hello' or 'Bye':
HELLO_OR_BYE: 'Hello' | 'Bye';   

The dot '.' represents any character:
ANY_CHARACTER: .;   

Use parenthesis whenever you feel like. Following expressions are equivalent to previous ones:
SALUTATION: ('Hello');
HELLO_OR_BYE: (('Hello') | ('Bye'));
ANY_CHARACTER: (.); 

Use '..' to express interval. Lexer rule matching either 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9:
NUMBER: '0'..'9';   

Use '+' to express repeated one or more times. Lexer rule matching any integer:
NUMBER: ('0'..'9')+;   

Of course, you can combine expressions. Lexer rule matching any integer without leading zeros:
NUMBER_NO_LEADING_ZEROS: ('0'..'9') | (('1'..'9')('0'..'9')+);   
Read it as: either one digit number or at least two digits number that does not start with a 0.

Use '*' to express repeated zero or more times. Lexer rule matching any sequence of characters (e.g. create token from whole input):
EVERYTHING: .*;   

Use '?' to express repeated zero or one times. Lexer rule matching one or two digits:
ONE_OR_TWO_DIGITS: ('0'..'9')('0'..'9')?;   

Use '~' to express inverse of a set. Lexer rule matching anything except a digit:
NOT_A_DIGIT: ~('0'..'9');   

How Lexer Works
If we ignore details, lexers algorithm is very simple. First, lexer selects the rule to be used. Second, lexer matches the selected rule with the input and outputs corresponding token. The beginning of the character stream is chopped off and lexer continues with a first step on the rest of the input.

Few things are important:
  • Lexer looks for the first longest matching rule.
  • The matching is greedy - selected token uses as many characters as possible.
  • Lexer never changes its previous decision.

First Longest Matching Rule
Lexer looks for the first longest matching rule. In other words, if multiple rules match the input, the longest one is selected. For example, both SHORTTOKEN and LONGTOKEN in the grammar
SHORTTOKEN: 'abc';
LONGTOKEN: 'abcabc';
match the beginning of the input stream abcabc. As the LONGTOKEN matches six characters and SHORTTOKEN matches only three characters, the lexer selects longer LONGTOKEN token.

If multiple rules match the same length of input, the first one is selected. The stream abc and grammar:
SHORTTOKEN: 'a';
FIRSTTOKEN: 'abc';
SAMELENGTHTOKEN: 'ab.';
would result in FIRSTTOKEN output stream.

Greedy Token Matching
The second step matches selected token with the input beginning. The matching is greedy - selected token uses as many characters as possible. As a result, following token always matches whole input:
WHOLE_INPUT: .*;

Greediness may lead to unexpected errors. Consider the grammar:
ENDING: 'bc';
REPEAT_ABC: ('abc')* 'a';
and stream abcabcabc. The beginning clearly matches the token REPEAT_ABC. The greedy algorithm matches all abc blocks available, including the last one. As the token REPEAT_ABC ends with a letter a, lexer reports an error:
line 1:9 mismatched character '<EOF>' expecting 'a'

Decision Finality
Once the lexer matched the token, it puts it to the output stream and moves forward in the input stream. It never changes previous decision. Theoretically, the grammar:
SHORT: 'aaa';
LONG: 'aaaa';
could split the input aaaaaa into token stream SHORT SHORT. However, the lexer will choose and match the longest token LONG. As the remaining input aa does not match any token, the lexer reports two errors:
line 1:4 no viable alternative at character 'a'
line 1:5 no viable alternative at character 'a'

Common Errors
The lexer chooses next token without fully matching it to the input. Once it eliminated all shorter rules, it stops looking ahead and chooses remaining longest token. If selected token does not match the input, second step reports an error.

For example, the input abcabQ and grammar:
SHORTTOKEN: 'abc';
LONGTOKEN: 'abcabc';

The token SHORTTOKEN matches 3 letters while the token LONGTOKEN matches more than 4 letters. Therefore, the first step will choose the token LONGTOKEN. Unfortunately, LONGTOKEN does not match the input:
line 1:5 mismatched character 'Q' expecting 'c'

If no token matches the input, an error is reported:
line 1:0 no viable alternative at character 'Q'

Fragments
Once you start to create real lexer rules, you may find out that some of them tend to be complicated and difficult to read.

To solve the problem, ANTLR provides a way to create helper rules. Helper rules do not represent tokens, they exist only to be referenced by other rules. All helper rules are marked with keyword fragment.

For example, we may rewrite previous:
NUMBER_NO_LEADING_ZEROS: ('0'..'9') | (('1'..'9')('0'..'9')+);   

to cleaner:
fragment DIGIT: ('0'..'9');
fragment DIGIT_NOT_ZERO: ('1'..'9');
NUMBER_NO_LEADING_ZEROS: DIGIT | (DIGIT_NOT_ZERO DIGIT+);   

Important: you can reference only fragment rules. E.g. following is wrong:
//non-fragment rule
SALUTATION: 'Hello';   
//Incorrect rule - ANTLR fails. It references non-fragment rule.
HELLO_OR_BYE: SALUTATION | 'Bye';   

White spaces
Most languages ignores all spaces, tabulators, end of line and similar characters. If you wish to do the same, add following into your grammar:
WS : ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; }


Parser - Basics

On the surface, parser rules seems to be very similar to the lexer rules - they use the same regular expression grammar. Under the surface, they are very different. More importantly, they are used in different way.

While lexer splits input into tokens, parser translates token stream into some other structure. In our case, it will be an abstract syntax tree and each parser rule represents small sub-tree of final tree.

First part of this chapter lists differences between lexer's and parser's expression grammar. Second part explains how parser works. The rest of the chapter shows typical problems you may run into.

Parser Rules
Both lexer and parser use standard regular expression grammar to match the input.
To distinguish between lexer and parser rules in the same file, each parser rule must begin with a lower case letter:
ruleName: parser expression;   

Any Token
As parser works on the token stream, the dot '.' meaning changed. The dot inside parser rule represents any token. Following rule always matches whole token stream:
//match all tokens in the stream
everything: .*;   

Referencing
Rules for what parser rules may reference are different too:
  • Parser rule can never reference helper lexer rule (those start with keyword fragment).
  • Parser rule can reference any non-fragment lexer rules.
  • Parser rule can reference any other parser rule.

Inline Tokens
You can use character sequences inside parser rules, ANTLR automatically creates corresponding tokens:
//ANTLR automatically creates token from the ',' comma
list: LISTNAME LISTMEMBER (',' LISTMEMBER)*;   

Abstract Syntax Tree
Each parser rule corresponds to small part of final abstract syntax tree. By default, each token represents one node and all nodes are connected to the root node. Root node is dummy node - it does not correspond to any token.

For example, input lownumbers 2, 3, 4 parsed with grammar:
LISTNAME: ('a'..'z')+;
LISTMEMBER: ('1'..'9')+;
//remove whitespaces from input
WS :  ( ' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}  ;
//ANTLR automatically creates token from the ',' comma
list: LISTNAME LISTMEMBER (',' LISTMEMBER)*;   

leads to following abstract syntax tree:
null
-- lownumbers
-- 2
-- ,
-- 3
-- ,
-- 4

To suppress useless tokens in the tree, use '!' symbol after corresponding tokens:
//remove ',' comma from the output
list: LISTNAME LISTMEMBER (','! LISTMEMBER)*;   

Put '^' symbol after the node you wish to be a parent:
//remove ',' comma from the output; LISTNAME is the root
list: LISTNAME^ LISTMEMBER (','! LISTMEMBER)*;   

New abstract syntax tree has much better structure than previous one:
lownumbers
-- 2
-- 3
-- 4

How Parser Works
Parser starts knowing which rule should correspond to the input and then tries to match it to the input stream. Matching always starts from left-most element of the rule and continues to the right.

Parser invoked with a call:
//parse expression out of the token stream
GeneratedParser.expression_return ret = parser.expression();

assumes that the input contains an input matching to the 'expression' rule. Read the call as "parse the input token stream assuming that it represents an expression" or "match the expression rule to the input tone stream".

Unless configured otherwise, parser rules are expected to be unambiguous. No matter what input, parser should not be forced to make decisions or backtrack. There must be only one way how to match the input to the rule.

If the rule starts with a token, parser compares it to the first token from the input token stream. If they are different, an error is reported. Otherwise, a mapping is created and parser continues with the next rule element. For example, the parser for the 'Hello World' grammar:
expression : SALUTATION ENDSYMBOL;
would validate whether the input token stream contains a SALUTATION token followed by an ENDSYMBOL token.

If the rule starts with reference to another parser rule, parser first matches the other rule to the input beginning. The parser for a grammar:
expression : otherexpression TOKEN;
otherexpression: ... something ...
matches otherexpression to the rule beginning first. Once the matching is done, parser validates whether the rest of the input begins with a token TOKEN.

If it encounters a rule with multiple alternatives:
expression : somekindofexpression | differentexpression;
somekindofexpression: SALUTATION NAME WELCOME;
differentexpression: SALUTATION WELCOME;
parser first decides between 'somekindofexpression' and 'differentexpression' rules. To decide which rule should be used, it investigates (look ahead) incoming token stream and decides accordingly. In this case, parser reads first two tokens and depending on the second one decides which alternative to use.

If no alternative matches the input token stream, parser reports an error. An attempt to parse the expression "Tree !" with a previous grammar would lead to following error:
line 1:0 no viable alternative at input 'Tree '

This algorithm has several consequences which are discussed in following chapters.

Independence of Lexer and Parser
It it important to remember, that lexer and parser run independently. First, the lexer splits input into tokens, then parser arrange them into some other form. The lexer has no knowledge of parser rules.

Theoretically, it would be possible to split the input Hello Hello ! with grammar:
ENSYMBOL: '!';
SALUTATION: 'Hello ';
NAME: ('a'..'z'|'A'..'Z')*;
expression : SALUTATION NAME ENSYMBOL;
into parseable token stream SALUTATION NAME ENDSYMBOL where the name is 'Hello '.

However, ANTLR will not do that. Instead, the lexer will split the input into two salutation tokens followed by one ensymbol token. As the token stream SALUTATION SALUTATION ENDSYMBOL makes no sense for the parser, it will report an error:
line 1:6 mismatched input 'Hello ' expecting NAME

Start Rule
Any grammar needs so-called start rule. Start rule is a rule that is not referenced by another rule. If your grammar does not have such rule, ANTLR generator will issue a warning:
no start rule (no rule can obviously be followed by EOF)

To avoid it, add a dummy start rule to your grammar:
start_rule: someOtherRule;   

Multiple Alternatives
Some grammars tend to be unclear:
SALUTATION:'Hello ';   
NAME:('A'..'Z')('a'..'z')*;

startRule: knowperson | unknowperson;
knowperson: SALUTATION NAME*;  
unknowperson: SALUTATION;

The input Hello parsed by previous grammar could match both knownperson and unknown person alternatives. In such case, the build would fail and ANTLR maven plugin would throw an error:
warning(200): Grammair.g:27:10: Decision can match input such as "SALUTATION" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

error(201): Grammair.g:27:10: The following alternatives can never be matched: 2

The parser warns us, that the rule startRule is unclear. The token stream SALUTATION, e.g. input Hello , could be interpreted in two different ways. As an unknownperson salutation or as a knownperson salutation followed by zero names.

The parser decided to always use the first alternative knowperson. As a result, the alternative unknowperson will not be used.

Ambiguity
Matching between token stream and parser rule should be unambiguous. If some token stream matches the rule, there should be only one way how to match them. This rule is very similar to the previous one.

The rule expression in Hello Word grammar:
SALUTATION:'Hello word';   
ENDSYMBOL:'!';

expression : SALUTATION ENDSYMBOL;
is unambiguous. Only one token stream SALUTATION ENDSYMBOL matches the rule and there is only one way how to map it. The token SALUTATION is mapped to the SALUTATION part of the rule and the token ENDSYMBOL is mapped to the ENDSYMBOL part of the rule.

On the other hand, following grammar is ambiguous:
TOKEN:'a' | 'b';   
expression : TOKEN* TOKEN TOKEN*;

The expression ba leads to the token stream TOKEN TOKEN where the first token corresponds to the letter 'b' and the second one corresponds to the letter 'a'.

The grammar has two possible mappings between the stream and the rule:
  • The first token 'b' could correspond to the first .* part. The second token 'a' then corresponds to the middle part of the rule TOKEN.
  • The first token 'b' could correspond to the middle TOKEN part. The second token 'a' then corresponds to the ending part of the rule .*.

Unambiguity is not a hard requirement. ANTLR is be able to create both parser and lexer from the previous grammar. However, it reports warning to the console:
warning(200): Decision can match input such as "TOKEN TOKEN" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

You cannot avoid the issue by simple tricks such as:
TOKEN:'a' | 'b';   
expression: trick TOKEN TOKEN*;
trick: TOKEN*;

The grammar is still ambiguous and ANTLR would complain.

Left-Recursion
You have to avoid something called left-recursion. Left recursion happens when the rule references itself on the leftmost part.

Imagine grammar containing lists of letters. Any list is either a letter (e.g. f) or a list of letters followed by a symbol ',' and another letter (e.g. m, f, b). We may use following rule to describe the list:
//ANTLR fails, rule breaks no left-recursion rule
letterslist: LETTER  | letterslist ',' LETTER;   

The above rule defines letterslist ad letterslist followed by something else which breaks no left recursion rule. ANTLR would complain:
error(210):  The following sets of rules are mutually left-recursive [letterslist]

If you would try to trick it and split a rule into two:
//ANTLR fails, rule breaks no left-recursion rule
letterslist: LETTER  | longletterslist;   
longletterslist: letterslist ',' LETTER;  

ANTLR would still complain. The rule still uses itself on the left part, it just uses another rule to do so. Such rules are called "mutually left-recursive":
error(210):  The following sets of rules are mutually left-recursive [letterslist, longletterslist]

The correct solution is to rewrite the rule to another equivalent form:
//finally works
letterslist: LETTER (',' LETTER)*;   

ANTLR web contains a longer article on the topic.

Non-LL(*) Decision Error
We continue with another example of faulty grammar. It should recognize inputs for a simple calculator:
NUMBER: ('0'..'9')*;

startRule: plus;
plus: expression '+' expression | expression;   
expression : NUMBER | '(' plus ')';   

The build fails and console output contains an error:
error(211): ParserExperiments.g:45:5: [fatal] rule plus has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.

The line 45 contains the offending rule:
plus: expression | expression '+' expression;   

To understand the problem, you have to recall that ANTLR goes from left to right whenever parsing an input. It first decides which alternative it will use. Then it sticks with the decision. The decision is based solely on tokens in upcoming stream.

Try to simulate it on a complicated input (1+(7+(9+(0))+(5+4))+(2)). Does it match an expression alternative, or an expression + expression alternative? The only way to decide is to parse whole input to see how does parenthesis match.

By default, ANTLR will not generate parser that would do that. The error suggests three possible solutions:

Left Factoring
Rewrite plus rule in previous grammar to another form. Take the leftmost part and put it before parenthesis. Offending rule
plus: expression | expression '+' expression;   

is equivalent to the rule
plus: expression (| '+' expression);

The new rule does not have previous problem. ANTLR now knows that the input token stream continues with an expression. Once it finishes parsing it, ANTLR will check the upcoming stream to see whether it contains plus token or nothing. An empty stream would cause ANTLR end and plus token would trigger next expression parsing.

The working grammar looks like this:
NUMBER: ('0'..'9')*;

startRule: plus;
plus: expression (| '+' expression);
expression: NUMBER | '(' plus ')';

As non LL(*) decision rule may be split through multiple other complicated rules, left factoring can be surprisingly difficult. However, you should use it whenever possible.

Syntactic Predicates
Syntactic predicates are parser conditions. Use them to help parser decide which alternative it should take:
rule: (condition_rule1) => alternative1 | (condition_rule2) => alternative2 | ... ;

The parser uses conditions to decide between rule alternatives. It always starts with a left most condition. If the incoming token stream matches condition_rule1, the parser will use alternative1.If the first condition does not match, the parser tries second one.

We can use it to solve problem in our last grammar:
NUMBER: ('0'..'9')*;

startRule: plus;
plus: (expression '+' expression) => expression '+' expression | expression;   
expression : NUMBER | '(' plus ')';   

New parser will always check expression '+' expression alternative first. If it is possible to parse it out of the stream, it will go for it. Otherwise, it will assume expression alternative.

Keep in mind, that this technique may slow down parsing. The parser may be forced to pass through whole input multiple times. Given right circumstances, which are met in our example, it turns fast linear algorithm into a slow exponential one.

Backtracking
As we wrote before, ANTLR will not go back to change what it already parsed unless you explicitly allow it. To allow it, add backtrack=true into your grammar options:
options
{
  ...

  //turn on backtracking
  backtrack=true;
}

ANTLR will try each alternative from left to right on each decision point and use the one that matches. In other words, backtrack=true option automatically adds condition (syntactic predicate) wherever needed. Unfortunately, this may require multiple parse attempts on the whole input multiple times.

As this alternative turns fast linear algorithm into a slow exponential one, it should be used rarely.

More on the topic is available on ANTLR page.

Boolean Expression Language

We are finally ready to create first boolean grammar. It has three operators:
  • && - logical and,
  • || - logical or,
  • ! - logical not.

Any expression can be encapsulated in parentheses ( ). For example, this is correct boolean expression:
bwahaha && (winner || !looser)

First Version
First, we create tokens for each operator and parenthesis:
LPAREN : '(' ;
RPAREN : ')' ;
AND : '&&';
OR : '||';
NOT : '!';

then we need token for names (e.g. bwahaha, winner, looser). Names are composed of letters and digits:
NAME : ('a'..'z' | '0'..'9')*; 

Last lexer rule deals with white spaces:
WS : ( ' ' | '\t' | '\r' | '\n' )+ { $channel = HIDDEN; };

Finally, we create parser rules. Grammar explanations is written in rules comments:
//start rule
expression : andexpression;
//first, we try to match all first level && (e.g. && not included in some sub-expression) 
andexpression : orexpression (AND orexpression)*;
//second, we try to match all first level || (e.g. || not included in some sub-expression) 
orexpression : notexpression (OR notexpression)*;
//third, there may or may not be first level ! in front of an expression
notexpression : NOT atom | atom;
//finally, we found either name, or a sub-expression
atom : NAME | LPAREN andexpression RPAREN;

This is simplest possible grammar that avoids parser rules problems explained in previous chapter.

Final grammar file is available on Github. If you wish to see it in action, compiler class and unit test are available too. Full project is downloadable under 004-S005SimpleBoolean tag.

Basic AST Modifications
While previous grammar parses all needed expressions, the result tree leaves a lot to be desired. For example, bwahaha && (winner || !looser) expression is compiled into following abstract syntax tree:
null
-- bwahaha
-- &&
-- (
-- winner
-- ||
-- !
-- looser
-- )

We do not like three things about it:
  • useless parenthesis nodes,
  • dummy root node,
  • operators (&&, ||, !) should be parents of sub-trees.

Each used parser rule should correspond to a sub-tree with operator on top and sub-expressions as childs. Parenthesis nodes are useless, expression structure should be clear from the tree structure. We put '^' behind tokens we wish to be on top and '!' behind useless nodes:
//start rule
expression : andexpression;
//first, we try to match all first level && (e.g. && not included in some sub-expression) 
andexpression : orexpression (AND^ orexpression)*;
//second, we try to match all first level || (e.g. || not included in some sub-expression) 
orexpression : notexpression (OR^ notexpression)*;
//third, there may or may not be first level ! in front of an expression
notexpression : NOT^ atom | atom;
//finally, we found either name, or a sub-expression
atom : NAME | LPAREN! andexpression RPAREN!;

The above grammar, its compiler and test are available on Github under 005-S006SimpleBoolean tag.

Lets try to parse previous expression bwahaha && (winner || !looser) again:
&&
-- bwahaha
-- ||
---- winner
---- !
------ looser

Add Multi-Parameter Functions
Before we proceed to the next chapter, which explains how to do major changes in the output tree, we add built-in functions to our grammar. New grammar contains four shortcut functions:
  • all(p1, p2, ..., pn) equivalent to p1 && p2 && ... && pn,
  • atleastone(p1, p2, ..., pn) equivalent to p1 || p2 || ... || pn.
  • neither(p1, p2, ..., pn) equivalent to !p1 && !p2 && ... && !pn,
  • notall(p1, p2, ..., pn) equivalent to !p1 || !p2 || ... || !pn.

We have to create new token for each function:
ALL : 'all';
ATLEASTONE : 'atleastone';
NEITHER : 'neither';
NOTALL : 'notall';

Placing is important. We must place new tokens before NAME lexer rule. If they would be after it, lexer would never recognize them:
error(208): S007BooleanWithFunctions.g:59:1: The following token definitions can never be matched because prior tokens match the same input: ALL,ATLEASTONE,NEITHER,NOTALL

The atom is now either a name, an expression or a function:
atom : function | NAME | LPAREN! andexpression RPAREN!;

Any function is composed of its name and arguments in parenthesis. Functions name should be the root and arguments its childs:
function : functionname^ arguments;
functionname : ALL | ATLEASTONE | NEITHER | NOTALL; 

Any expression is allowed as an argument:
arguments : LPAREN! andexpression (','! andexpression)* RPAREN!;

Any interpreter of our new grammar has to deal with four new tokens in abstract syntax tree. For example, the expression all(notall(p1, p2 && p3), neither(p2,p3, atleastone(p2,p4))) compiles into following tree:
all
-- notall
---- p1
---- &&
------ p2
------ p3
-- neither
---- p2
---- p3
---- atleastone
------ p2
------ p4


Those new tokens are only shortcuts, they provide no new functionality. It would be much better, if the parser created an equivalent tree structure containing only &&, ||, ! and NAME tokens.

The grammar, its compiler and test are available on Github under 006-S007BooleanWithFunctions tag.


Rewrite Rules

Default abstract syntax tree generated by ANTLR has very unfriendly structure. It is composed of one dummy root node with all recognized tokens as childs. We have been able to suppress nodes and specify roots of sub-trees with operators '^' and '!'. Those are great, but very limited.

Rewrite rules are last ANTLR feature discussed in the post. They allow major changes in generated abstract syntax trees. The general form is simple:
ruleName: parser expression -> rewrite rule;   

Rewrite rule part behind the -> specifies how exactly should final abstract syntax tree look like. They can customize abstract syntax tree structure, add new tokens into it or suppress some tokens.

Tree Structure
Use ^(...) syntax to specify desired tree structure. The parenthesis contains tokens and parser rules referenced in regular expression before ->.
rule: some TOKEN and other Elements -> ^(TOKEN and other DUMMY Elements);   

Parser takes everything in the parentheses and creates sub-tree from it. The first element in parenthesis one must be a token. It becomes the root of created sub-tree. All following elements are childs and sub-sub-trees:
TOKEN 
-- and 
-- other 
-- DUMMY   
-- Elements  

The ^(...) parenthesis does not have to contain all elements matched in the parser expressions. Ommited elements are left from the result tree. Referenced parser rule some in the above example is missing from final tree.

The ^(...) may contain dummy tokens - tokens missing from the left rule part. The example tree contains a child node with dummy token DUMMY.

Simple Example
The rule for binary && operator:
binaryAnd: expression AND^ expression;   
can be rewritten to equivalent rule with ^(...):
binaryAnd: expression AND expression -> ^(AND expression expression);   

Nested Example
Of course, you can nest ^(...). For example, slightly longer rule:
binaryAnd: expression AND^ expression AND^ expression;   
is equivalent to following rule:
binaryAnd: expression AND expression AND expression -> ^(AND expression ^(AND expression expression));   

ANTLR keeps order of elements. E.g. the first expression element on the left part correspond to first expression element after the ->. Second expression element on the left part correspond to second expression element after the -> and so on.

The input bwahaha && something && else results in following tree:
&&
-- bwahaha
-- &&
---- something  
---- else  

Labels
You can assign labels to the elements on the left and use those labels in the rewrite rule. Labels are particularly usefull if you need to do complicated transformations.

Reordering
The input bwahaha && something && else parsed with following expression:
binaryAnd: a=expression AND b=expression AND c=expression -> ^(AND $c ^(AND $b $a));   
results in reordered tree:
&&
-- else  
-- &&
---- something  
---- bwahaha

Multiple Elements
Use += to assign the same label to multiple elements. The label then contains a list of elements. When referenced, labels are used in the same order as they appeared in the input. Following rules are equivalent:
binaryAnd: expression AND expression AND expression -> ^(AND expression expression expression);   

binaryAnd: a+=expression AND a+=expression AND a+=expression -> ^(AND $a $a $a);  

The input bwahaha && something && else parsed with previous two expressions leads to a flat tree:
&&
-- bwahaha
-- something  
-- else  

Flush List
It is possible to flush out all list elements with operators + and *. Next two rules are equivalent to previous two rules:
binaryAnd: a+=expression AND a+=expression AND a+=expression -> ^(AND $a+);
binaryAnd: a+=expression AND a+=expression AND a+=expression -> ^(AND $a*);

The input bwahaha && something && else is still leads to the same flat tree:
&&
-- bwahaha
-- something  
-- else  

Of course, the label can contain any number of elements. The previous rule is able to handle two ANDs in a row. We can change it to handle at least two ANDs:
//flat tree from at least two expressions
binaryAnd: a+=expression (AND a+=expression)+ -> ^(AND $a*);   

The rule parses any number of AND, but the tree from bwahaha && something && else does not change:
&&
-- bwahaha
-- something  
-- else  

The difference between + and * operators is simple: * is able to handle an empty list while + throws a RewriteEarlyExitException exception in that case.
//incorrect, fails if faced with one member long list 'hello'
list:  b=expression (',' a+=expression)* -> ^($b $a+);      
//correct expression
list:  b=expression (',' a+=expression)* -> ^($b $a*);      

Repeat Structures
Operator plus + and asterisk * do not have to be directly behind the list. It is possible to repeat any structure surrounding repeated list label.

For example, you can use it to add dummy token on top of each sub-tree from previous rule:
binaryAnd: a+=expression (AND a+=expression)+ -> ^(AND ^(DUMMY $a)+);
binaryAnd: a+=expression (AND a+=expression)+ -> ^(AND ^(DUMMY $a)*);

This is resulting tree:
&&
-- DUMMY
---- bwahaha
-- DUMMY
---- something  
-- DUMMY
---- else  

Conditions
Each rule may have multiple rewrite rules. Which one is used depends on {condition}? condition. In general, the rule have the form:
ruleName: parser expression 
  -> {condition1}? rewrite rule 1 
  -> {condition2}? rewrite rule 2 
  -> ...
  -> {conditionn}? rewrite rule n 
  -> falback rewrite rule

Condition is any valid expression in whatever programming language is used. You can reference labels using usual $label syntax.

For example, we can enhance 'flat tree from at least two expressions' to 'flat tree from any number of expressions (including one)'. BinaryAnd will match any number of expressions. First expression is labeled b, the rest goes to the list a:
//flat tree from any number of expressions (including one)
binaryAnd: b=expression (AND a+=expression)* 

If there are no additional arguments, e.g. the list a is empty, binaryAdd is equivalent to its first argument b:
//if the list $a is empty, use only first argument 
  -> {$a==null || $a.isEmpty()}? $b

If it is not empty, create usual flat tree:
//otherwise create flat tree from all arguments
  -> ^(AND $b $a*);

This is final rule:
//flat tree from any number of expressions (including one)
binaryAnd: b=expression (AND a+=expression)* 
  //if the list $a is empty, use only first argument 
  -> {$a==null || $a.isEmpty()}? $b
  //otherwise create flat tree from all arguments
  -> ^(AND $b $a*);

Abstract syntax tree generated from single expression bwahaha is:
bwahaha

and abstract syntax tree generated from bwahaha && winner && looser is:
AND
-- bwahaha
-- winner
-- looser


Multi-Parameter Functions With Rewrite Rules

Finally, we are ready to add some reasonable structure to our multi-parameters functions grammar. Recall, that our grammar has four functions:
  • all(p1, p2, ..., pn) equivalent to p1 && p2 && ... && pn,
  • atleastone(p1, p2, ..., pn) equivalent to p1 || p2 || ... || pn.
  • neither(p1, p2, ..., pn) equivalent to !p1 && !p2 && ... && !pn,
  • notall(p1, p2, ..., pn) equivalent to !p1 || !p2 || ... || !pn

and its abstract syntax tree should hide existence of all four new keywords.

First, the original general function:
//original functions definition with complicated tree
function : functionname^ arguments;
functionname : ALL | ATLEASTONE | NEITHER | NOTALL; 
arguments : LPAREN! andexpression (','! andexpression)* RPAREN!;

is split into multiple concrete functions:
//function definition
function : allFunction | ateastoneFunction | neitherFunction | notallFunction;
//concrete functions
allFunction : ...;
ateastoneFunction : ...;
neitherFunction : ...;
notallFunction: ...;

The function all may have either one or multiple arguments. We use rewrite rules conditions to generate different trees in each case:
//the function all(p1, p2, ..., pn) is equivalent to p1 && p2 && ... && pn
allFunction : ALL LPAREN b=andexpression (',' a+=andexpression)* RPAREN 
  //if the list $a  is empty, use only first argument 
  -> {$a==null || $a.isEmpty()}? $b
  //otherwise create a flat tree 
  -> ^(AND $b $a*);

Abstract syntax tree generated from all(bwahaha) is:
bwahaha

and abstract syntax tree generated from all(bwahaha, winner, looser) is:
AND
-- bwahaha
-- winner
-- looser

The function atleastone is almost the same as the previous all function. Simply replace && operator with || operator:
//the function ateastone(p1, p2, ..., pn) is equivalent to p1 || p2 || ... || pn
atleastoneFunction : ATLEASTONE LPAREN b=andexpression (',' a+=andexpression)* RPAREN 
  //if the list $a is empty, use only first argument 
  -> {$a==null || $a.isEmpty()}? $b
  //otherwise create a flat tree 
  -> ^(OR $b $a*);

The function neither copies all function, except that there is a dummy root node NOT ! in front of each argument:
//the function neither(p1, p2, ..., pn) is equivalent to !p1 && !p2 && ... && !pn
neitherFunction : NEITHER LPAREN b=andexpression (',' a+=andexpression)* RPAREN 
  //if the list $a is empty, use only first argument 
  -> {$a==null || $a.isEmpty()}? ^(NOT $b)
  //otherwise create a flat tree 
  -> ^(AND ^(NOT $b) ^(NOT $a)*);

Abstract syntax tree generated from neither(bwahaha) is:
NOT
-- bwahaha

and abstract syntax tree generated from neither(bwahaha, winner, looser) is:
AND
-- NOT
---- bwahaha
-- NOT
---- winner
-- NOT
---- looser

Finally, the function notall copies atleastone function. Of course, there is a dummy root node NOT ! in front of each argument:
//the function notall(p1, p2, ..., pn) is equivalent to !p1 || !p2 || ... || !pn
notallFunction : NOTALL LPAREN b=andexpression (',' a+=andexpression)* RPAREN 
  //if the list $a is empty, use only first argument 
  -> {$a==null || $a.isEmpty()}? ^(NOT $b)
  //otherwise create a flat tree 
  -> ^(OR ^(NOT $b) ^(NOT $a)*);

Abstract syntax tree generated from notall(bwahaha) is:
NOT
-- bwahaha

and abstract syntax tree generated from notall(bwahaha, winner, looser) is:
OR
-- NOT
---- bwahaha
-- NOT
---- winner
-- NOT
---- looser

As usually, complete grammar including compiler and test case are available on Github under 007-S008FunctionsNiceTree tag.

End

Depending on the language you try to create, parser construction may be very easy or difficult. As the topic is very broad, we covered only basics needed to create an abstract syntax tree for an expression in a simple expression language.

Little knowledge of antlr may turn some time-consuming tasks into easy and fast exercices. We hope that this post gave you basic idea of what is possible and how to achive it.

11 comments:

Anonymous said...

This was very informative. But how do you get around the "greedy" token matching in a grammar like:

grammar Option ;

//
// any number of optionstacks
//

option : OPTIONSTACK_GROUP + ;

//
// At least one option_line per stack
//

OPTIONSTACK_GROUP : OPTIONSTACK_LINE
OPTION_LINE + ;

OPTIONSTACK_LINE : WS 'OPTIONSTACK' ':'
optionstackname=STRING { System.out.println ("optionstackname:" + optionstackname); }
WS;

OPTION_LINE : WS 'OPTION' ':'
optionname=STRING { System.out.println ("optionname:" + optionname); }
':' optionvalue=STRING { System.out.println ("actionvalue:" + optionvalue); }
WS ;

//
// any number (including zero) of spaces or tabs and returns
//

WS : ( '\t' | ' ' ) * ( ( '\r' ? '\n') * | 'uffff' ) ;

//
// we need the slash for filenames eventually
//

STRING : ( 'a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '.' | '_' | '/' ) * ;

======
which fails on the second OPTIONSTACK in this data file:
======
OPTIONSTACK:PARSER
OPTION:TRACE:1
OPTION:VERBOSE:1

OPTIONSTACK:ANALYZER
OPTION:TRACE:1
OPTION:VERBOSE:1

OPTIONSTACK:OPTIMIZER
OPTION:TRACE:1
OPTION:VERBOSE:1

OPTIONSTACK:GENERATOR
OPTION:TRACE:1
OPTION:VERBOSE:1

Anonymous said...

This was very informative. But how do you get around the "greedy" token matching in a grammar like:

grammar Option ;

//
// any number of optionstacks
//

option : OPTIONSTACK_GROUP + ;

//
// At least one option_line per stack
//

OPTIONSTACK_GROUP : OPTIONSTACK_LINE
OPTION_LINE + ;

OPTIONSTACK_LINE : WS 'OPTIONSTACK' ':'
optionstackname=STRING { System.out.println ("optionstackname:" + optionstackname); }
WS;

OPTION_LINE : WS 'OPTION' ':'
optionname=STRING { System.out.println ("optionname:" + optionname); }
':' optionvalue=STRING { System.out.println ("actionvalue:" + optionvalue); }
WS ;

//
// any number (including zero) of spaces or tabs and returns
//

WS : ( '\t' | ' ' ) * ( ( '\r' ? '\n') * | 'uffff' ) ;

//
// we need the slash for filenames eventually
//

STRING : ( 'a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '.' | '_' | '/' ) * ;

======
which fails on the second OPTIONSTACK in this data file:
======
OPTIONSTACK:PARSER
OPTION:TRACE:1
OPTION:VERBOSE:1

OPTIONSTACK:ANALYZER
OPTION:TRACE:1
OPTION:VERBOSE:1

OPTIONSTACK:OPTIMIZER
OPTION:TRACE:1
OPTION:VERBOSE:1

OPTIONSTACK:GENERATOR
OPTION:TRACE:1
OPTION:VERBOSE:1

Meri said...

Hi Anonymous,

nice exercise you got :). That grammar file has three problems:
1.) You are making whitespaces mandatory, I would rather ignore them.

2.) Lexer rule may reference only fragmented lexer rule. OPTIONSTACK_GROUP in that grammar references both OPTIONSTACK_LINE and OPTION_LINE lexer rules.

3.) Each lexer rule corresponds to one token. As a result, even if the grammar would work, it would create one big token from:
OPTIONSTACK:PARSER
OPTION:TRACE:1
OPTION:VERBOSE:1

another big token from
OPTIONSTACK:ANALYZER
OPTION:TRACE:1
OPTION:VERBOSE:1

and so on. Basically, parser would have nothing to do. You are trying to put all functionality into lexer and it is not build for that.

The solution to this is simple: change OPTIONSTACK_GROUP, OPTIONSTACK_LINE and OPTION_LINE lexer rules into optionstack_group, optionstack_line, option_line parser rules.

Hope that helped. The grammar is in a next comment.

Meri said...

grammar Option ;

//
// any number of optionstacks
//

option : optionstack_group + ;

//
// At least one option_line per stack
//

optionstack_group : optionstack_line
option_line + ;

optionstack_line : 'OPTIONSTACK' ':'
optionstackname=STRING { System.out.println ("optionstackname:" + optionstackname); };

option_line : 'OPTION' ':'
optionname=STRING { System.out.println ("optionname:" + optionname); }
':' optionvalue=STRING { System.out.println ("actionvalue:" + optionvalue); }
;

//
// any number (including zero) of spaces or tabs and returns
//

WS : ( '\t' | ' ' ) * ( ( '\r' ? '\n') * | 'uffff' ) {$channel=HIDDEN;} ;

//
// we need the slash for filenames eventually
//

STRING : ( 'a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '.' | '_' | '/' ) * ;

*****************************************************
OUTPUT ON OPTIONS FILE
*****************************************************
optionstackname:[@2,12:17='PARSER',<4>,1:12]
optionname:[@6,26:30='TRACE',<4>,2:7]
actionvalue:[@8,32:32='1',<4>,2:13]
optionname:[@12,41:47='VERBOSE',<4>,3:7]
actionvalue:[@14,49:49='1',<4>,3:15]
optionstackname:[@18,64:71='ANALYZER',<4>,5:12]
optionname:[@22,80:84='TRACE',<4>,6:7]
actionvalue:[@24,86:86='1',<4>,6:13]
optionname:[@28,95:101='VERBOSE',<4>,7:7]
actionvalue:[@30,103:103='1',<4>,7:15]
optionstackname:[@34,118:126='OPTIMIZER',<4>,9:12]
optionname:[@38,135:139='TRACE',<4>,10:7]
actionvalue:[@40,141:141='1',<4>,10:13]
optionname:[@44,150:156='VERBOSE',<4>,11:7]
actionvalue:[@46,158:158='1',<4>,11:15]
optionstackname:[@50,173:181='GENERATOR',<4>,13:12]
optionname:[@54,190:194='TRACE',<4>,14:7]
actionvalue:[@56,196:196='1',<4>,14:13]
optionname:[@60,205:211='VERBOSE',<4>,15:7]
actionvalue:[@62,213:213='1',<4>,15:15]

Victor said...

Very good article, thank you for this very valuable information.

Cliff said...

Thank you for posting this, specifically the conditional re-write rules, it was exactly what I was searching for.

Anonymous said...

Thanks for the 'ANTLR lexer uses standard regular expressions grammar.'
Just what the doctor ordered!

Meri said...

@Anonymous Happy to hear that :).

Anonymous said...

This is really nice

Antonio Guilherme said...

Thank you very much for this blog post. It's much better than the official documentation and examples.

Anonymous said...

Very informative, very concise, very well explained. Thank you for sharing in this post

Post a Comment