Engineering a compiler through examples: building a mathematical expression engine - Part 5
In this post, we will continue developing our compiler by implementing its second component: the syntactic analyzer. Along the way, we’ll explore the tools commonly used for this purpose (such as grammars).
As we saw in the previous post, the lexical analysis results in a stream of tokens. The next step is to ensure that this sequence is syntactically correct—that is, it follows a set of predefined rules. If it does, we can then construct the corresponding tree data structure that represents the mathematical expression.
It is the role of the syntactic analyzer (also known as the parser) to carry out this step.
How to know if a sequence of tokens is syntactically correct ?
Here, we’ll draw on a simple and intuitive analogy to understand the concept: how do we determine whether a sentence in our native language is grammatically correct? Let’s consider the following example.
I think that Sweden is one of the most beautiful country in the world.
Is this sentence grammatically (i.e., syntactically) correct? Yes, it is. It begins with a subject ("I"), followed by a verb ("think"), which is then followed by a subordinate clause containing its own subject ("Sweden"), a verb ("is"), and an adjective ("beautiful"). While this process might seem a bit involved, our brains automatically perform all these checks—drawing on what we learned as children—and conclude that the sentence is well-formed. We’re so accustomed to our native language that we’re often unaware of all this underlying mental processing.
Sweden think I that is of one the world in country beautiful most the.
This sentence now makes absolutely no sense. Admittedly, it is written in English (and a lexical analyzer would flag no errors), but the words are clearly thrown together in a random order, resulting in gibberish. Our brain instantly picks up on this because, precisely, we’re used to sentences following a familiar structure: a subject, followed by a verb, then a complement. Since that pattern is broken here, we naturally conclude that the sentence is not syntactically correct.
Since childhood, we've been accustomed to following a set of rules that everyone agrees upon and that we learn in school. In fact, when we speak, we're expected to follow grammatical rules—otherwise, we risk not being understood. The same principle applies to programming languages: we must adhere to a predefined set of rules.
This set of rules is what we refer to as a grammar.
The parser obtains a string of tokens from the lexical analyzer and verifies that the string of token names can be generated by the grammar of the source language.
Compilers: Principles, Techniques, and Tools (Aho, Ullman, Sethi, Lam)
How to formally define a grammar ?
Our goal is to define a grammar (a set of rules) that our mathematical expressions must follow in order to be considered valid. We will therefore provide the formal definition of what a grammar is and illustrate it later with concrete examples.
A grammar is a formal set of rules that specifies how sequences of tokens (produced by the lexical analyzer) can be combined to form meaningful constructs (such as expressions, statements, functions).
More precisely, a grammar is defined as a quadruple $(T, V, P, S)$, where:
$T$ denotes the set of terminal symbols: they are the basic symbols from which strings are formed (usually the tokens like numbers, variables, operators, functions).
$V$ denotes the set of non-terminal symbols: they are abstract symbols representing patterns or groupings (like Expression, Term, Factor, etc...).
$P$ denotes the set of production rules: they are rules that describe how non-terminal symbols can be replaced by combinations of terminals and/or other non-terminals.
$S$ denotes the start symbol: it is the non-terminal symbol that represents a complete structure (for example Expression).
This definition is fairly straightforward, but it can feel a bit abstract at first glance. To make it more concrete, let’s look at a simple example where we define a grammar for a basic arithmetic expression language.
1(1) list ::= list + digit
2(2) list ::= list - digit
3(3) list ::= digit
4(4) digit ::= 0|1|2|3|4|5|6|7|8|9
This grammar is capable of representing expressions such as $8 - 4 + 6$ or $8 - 4$. For this grammar:
- $T = +, -, 0, 1, , 2, 3, 4, 5, 6, 7, 8, 9$
- $V = list, digit$
- $P$ is the set of productions listed below.
- $S = list$
A grammar derives strings by beginning with the start symbol and repeatedly replacing a nonterminal by the body of a production for that nonterminal.
Compilers: Principles, Techniques, and Tools (Aho, Ullman, Sethi, Lam)
For instance, we can determine whether the expression $8 - 4 + 6$ can be derived from the previously defined grammar by applying the following sequence of rules.
- We begin with the start symbol (which in this case is list) and with the first token (which is $8$).
- Since $8$ is a digit, we can apply rule (3) (list ::= digit) and determine that $8$ is a list.
- Since $8$ is a list and $4$ is a digit, we can apply rule (2) (list ::= list - digit) and determine that $8-4$ is a list.
- Since $8-4$ is a list and $6$ is a digit, we can apply rule (1) (list ::= list + digit) and determine that $8-4+6$ is a list.
Using this formalism, we will now define a grammar for our mathematical expressions.
Defining a grammar for mathematical expressions
The following grammar can be used to meet our requirements. This grammar accounts for the unary minus operator as well as the exponentiation of an expression.
1Expression := Term (('+' | '-') Term)*
2Term := Power (('*' | '/') Power)*
3Power := Unary ('^' Power)?
4Unary := ('-' Unary) | Primary
5Primary := NUMBER | VARIABLE | FUNCTION '(' Expression ')' | '(' Expression ')'
6VARIABLE := [a-zA-Z][a-zA-Z0-9_]*
7FUNCTION := [cos|sin|exp|ln|sqrt]'(' Expression ')'
8NUMBER := [0-9]+
At first glance, this grammar seems natural and quite intuitive. While that's true, it is actually not so easy to refine. In fact, it is specifically designed to handle the precedence of * over +, to be unambiguous, and to avoid being left-recursive (see infra for more details).
Designing a grammar, even one as simple as this, is not a trivial task. While we won’t dive into the intricate details, there are some crucial requirements we must meet:
The grammar must be unambiguous, meaning there should be only one possible way to derive a string from the grammar. For example, in our case, $y + 2 * x$ must not be interpreted as $(y + 2) * x$ (since * takes precedence over +). A naive grammar might not account for this properly.
Similarly, the grammar must avoid left-recursivity. Left-recursivity refers to a situation in a grammar where a non-terminal symbol appears as the first symbol on the right-hand side of one of its own production rules. This creates a recursive loop that can lead to infinite recursion when trying to parse an input.
For more information, refer to Compilers: Principles, Techniques, and Tools (Aho, Ullman, Sethi, Lam) or Engineering a Compiler (Cooper, Torczon).
How does one perform syntactic analysis ?
It’s now time to write a concrete parser for our grammar. In fact, there are several methods for parsing, but here we will use the simplest one, derived from top-down parsing.
Top-down parsing is a method of parsing where the process starts from the start symbol of the grammar and works downwards, recursively applying production rules to break the input into smaller and smaller components until the input tokens are matched.
The parsing process begins at the top of the syntax tree, typically with the start symbol of the grammar (in our scenario, Expression).
The parser recursively applies production rules to decompose the start symbol into smaller components (non-terminals or terminals). If the parser makes a wrong prediction (that is, chooses the wrong production rule), it may have to backtrack and try another rule (this operation can be costly).
It continues breaking down the structure until the terminal symbols (tokens) are matched with the input.
An example, please !
Let’s consider a very simple grammar (where the start symbol is Expression).
1Expression ::= Term | Term + Expression
2Term ::= Number
3Number ::= [0-9]+
How would a parser go about parsing the simple string $3+4$ ?
We start with Expression. At this point, we can see that there are two possible production rules (Expression ::= Term and Expression := Term + Expression).
For example, it would try to match the rule Expression ::= Term, but it would quickly fail to parse the "+" operator and would therefore be forced to backtrack—meaning it would return to a previous point and attempt another production rule.
It would then try to match the rule Expression ::= Term + Expression, then parse Term and look for the + symbol. Once the first Term (which is $3$ in this case) is matched, it moves to parse the remaining Expression, which consists of the next Term (the $4$ in this case).
Even in this incredibly simple example, we can see that the parser had to try several rules before successfully finding the correct one. This backtracking process can significantly impact performance and should be avoided whenever possible. Parsers that can predict the next rule based solely on the next token are called predictive parsers, and they are highly sought after for their efficiency.
Our grammar for mathematical expressions can be parsed using a predictive parser, which will simplify the code, as we will see.
The other important technique for parsing is bottom-up parsing. In short, this approach works in the opposite direction, starting from the input tokens and gradually building up to the start symbol by applying production rules in reverse. It constructs the parse tree from the leaves (tokens) upwards, ultimately reaching the root (the start symbol).
Coding our parser
Our grammar can be parsed using a top-down parser, and more specifically, a predictive top-down parser (recall that not all top-down parsers are predictive—we previously saw an example where backtracking was required). It turns out that predictive parsers lend themselves well to a simple recursive implementation. From a broader perspective, the code is as follows (taken from the Dragon Book).
1/* This code must be written for each nonterminal symbol*/
2void A()
3{
4 choose a production rule A ::= X1X2...Xk
5 for(i = 1 to k)
6 {
7 if (Xi is a nonterminal) call Xi();
8 else if (Xi equals the current symbol a) advance the input to the next symbol;
9 else /* An error has occurred.*/
10 }
11}
A recursive-descent parsing program consists of a set of procedures, one for each non terminal.
Compilers: Principles, Techniques, and Tools (Aho, Ullman, Sethi, Lam)
Below is the concrete implementation of this algorithm in our dedicated scenario. We begin by defining an interface for the parser, just as we did for the tokenizer.
1public interface IParser
2{
3 Expression Parse();
4}
The interface contains a single method, whose purpose is to return an expression parsed from the input string. A concrete implementation of this interface is provided below.
1public class Parser : IParser
2{
3 private ITokenizer _tokenizer;
4 private Token _currentToken;
5
6 public Parser(ITokenizer tokenizer)
7 {
8 _tokenizer = tokenizer;
9 }
10
11 public Expression Parse()
12 {
13 GetNextToken();
14 return E();
15 }
16
17 private Expression E()
18 {
19 var list = new List<Expression> { T() };
20 while (_currentToken == ("+", TokenType.OPERATOR) || _currentToken == ("-", TokenType.OPERATOR))
21 {
22 if (_currentToken.Lexeme == "-")
23 {
24 GetNextToken();
25 list.Add(new MinusUnaryExpression(T()));
26 }
27
28 if (_currentToken.Lexeme == "+")
29 {
30 GetNextToken();
31 list.Add(T());
32 }
33 }
34
35 return list.Count == 1 ? list[0] : new SumExpression(list.ToArray());
36 }
37
38 // Term → Factor (('*' | '/') Factor)*
39 private Expression T()
40 {
41 var list = new List<Expression> { P() };
42 while (_currentToken == ("*", TokenType.OPERATOR) || _currentToken == ("/", TokenType.OPERATOR))
43 {
44 if (_currentToken.Lexeme == "/")
45 {
46 GetNextToken();
47 list.Add(new InverseExpression(P()));
48 }
49
50 if (_currentToken.Lexeme == "*")
51 {
52 GetNextToken();
53 list.Add(P());
54 }
55 }
56
57 return list.Count == 1 ? list[0] : new ProductExpression(list.ToArray());
58 }
59
60 private Expression P()
61 {
62 var p = U();
63 if (_currentToken == ("^", TokenType.OPERATOR))
64 {
65 GetNextToken();
66 var r = new PowerExpression(p, Convert.ToDouble(_currentToken.Lexeme));
67 GetNextToken();
68 return r;
69 }
70
71 return p;
72 }
73
74 // Unary → ('-' Unary) | Primary
75 private Expression U()
76 {
77 if (_currentToken == ("-", TokenType.OPERATOR))
78 {
79 GetNextToken();
80 return new MinusUnaryExpression(U()); // Apply unary minus to the next expression
81 }
82 else
83 {
84 return Primary();
85 }
86 }
87
88 // Primary → NUMBER | VARIABLE | FUNCTION '(' Expression ')' | '(' Expression ')'
89 private Expression Primary()
90 {
91 if (_currentToken.Type == TokenType.CONSTANT)
92 {
93 var r = new ConstantExpression(double.Parse(_currentToken.Lexeme));
94 GetNextToken();
95 return r;
96 }
97 if (_currentToken.Type == TokenType.VARIABLE)
98 {
99 var r = new VariableExpression(_currentToken.Lexeme);
100 GetNextToken();
101 return r;
102 }
103
104 if (_currentToken.Type == TokenType.FUNCTION)
105 {
106 var fname = _currentToken.Lexeme;
107 GetNextToken();
108 if (_currentToken == ("(", TokenType.PARENTHESIS))
109 {
110 GetNextToken();
111 var e = E();
112 if (_currentToken == (")", TokenType.PARENTHESIS))
113 {
114 GetNextToken();
115 switch (fname)
116 {
117 case "sin":
118 return new SinusExpression(e);
119 case "cos":
120 return new CosinusExpression(e);
121 case "ln":
122 return new LnExpression(e);
123 case "exp":
124 return new ExpExpression(e);
125 case "sqrt":
126 return new SqrtExpression(e);
127 default:
128 throw new Exception($"A syntax error has occurred.");
129 }
130 }
131 else
132 throw new Exception($"A syntax error has occurred.");
133 }
134
135 }
136
137 if (_currentToken == ("(", TokenType.PARENTHESIS))
138 {
139 GetNextToken();
140 var e = E();
141 if (_currentToken == (")", TokenType.PARENTHESIS))
142 {
143 GetNextToken();
144 return e;
145 }
146 else
147 throw new Exception($"A syntax error has occurred.");
148 }
149
150 throw new Exception($"Unexpected character '{_currentToken.Lexeme}'");
151 }
152
153 private void GetNextToken()
154 {
155 _currentToken = _tokenizer.GetNextToken();
156 }
157}
The parser takes an instance of the lexical analyzer as an argument and recursively applies the rules until it constructs the corresponding expression tree or fails.
Running the prgram
It's now time to see our parser in action.
To sum up, we now have a parser that allows us to express complex mathematical expressions in a user-friendly and intuitive way, translating them into a tree structure on which operations can be easily performed. In the next post, we’ll see how to bring everything together by applying it to a simple optimization problem.
Engineering a compiler through examples: building a mathematical expression engine - Part 6