Engineering a compiler through examples: building a mathematical expression engine - Part 4

In this post, we will begin building our compiler by developing the first component: the lexical analyzer. Along the way, we’ll explore how tools like finite automata can prove to be extremely useful.

In the previous post, we explored how to easily model mathematical expressions using a tree data structure. While we found these structures to be quite powerful for performing various operations, we also discovered that they can be cumbersome to work with and often error-prone and time-consuming to handle. Simply recall how even a straightforward expression like $2x + cos(y)$ can become surprisingly verbose to represent.

1var f = new SumExpression(new ProductExpression(new ConstantExpression(2), new VariableExpression("x")), new CosinusExpression(new VariableExpression("y")));

What is our goal ?

The goal is to implement a Function class whose constructor accepts a simple string and internally constructs the corresponding expression using the previously described model.

This way, we get the best of both worlds: the user can input expressions using natural mathematical conventions, while internally we benefit from the power and flexibility of the tree-based representation.

Writing a lexical analyzer

The first step in achieving this is to ensure that the user’s input expression is lexically correct. As we've already discussed, this is precisely the role of a lexical analyzer.

How does one perform lexical analysis ?

The main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes and produce as output a sequence of tokens for each lexeme in the source program.
Compilers: Principles, Techniques, and Tools (Aho, Ullman, Sethi, Lam)

This definition, taken from the renowned Dragon Book, introduces some abstract concepts such as lexemes and tokens. Let’s explore these right away.

  • Each individual component of the input string, typically separated by whitespaces, is commonly referred to as a lexeme. In the given sentence above, for instance, '2', 'x', and 'cos' constitute distinct lexemes.

  • Each lexeme corresponds to a lexical category in the language in question, such as being a variable, operator, or function. To illustrate, 'cos' is a function, 'x' is a variable, '+' represents an operator, and so forth.

  • A token is simply the amalgamation of these two pieces of information: the current lexeme and its associated category. Consequently, examples of tokens would be (FUNCTION | cos) or (VARIABLE | x).

A lexical analyzer thus initiates its process with a string, representing the raw source code written by a programmer (or, in our scenario, a mathematical expression written by a human). The goal is to determine the sequence of corresponding tokens, as illustrated in the image below.

Modeling lexemes ans tokens

The previous definitions lead us to the following code (a category is represented by the TokenType class).

1public enum TokenType { CONSTANT, VARIABLE, OPERATOR, FUNCTION, PARENTHESIS, EOF }
 1public class Token
 2{
 3    public string Lexeme { get; }
 4
 5    public TokenType Type { get; }
 6
 7    public Token(string lexeme, TokenType type)
 8    {
 9        Lexeme = lexeme;
10        Type = type;
11    }   
12
13    public override string ToString()
14    {
15        return $"({Type}|{Lexeme})";
16    }
17}

Defining the lexical analyzer

We will now define a general interface for a lexical analyzer (also referred as a tokenizer in the litterature) as follows.

1public interface ITokenizer
2{
3    Token GetNextToken();
4}
Information

This interface contains a single method, which is responsible for returning the next token from a given input string. As we’ll see later, the tokenizer is used by the syntactic parser to retrieve tokens one at a time (the parser will frequently need to know what the next token is).

An instance of this interface is provided now.

 1public class Tokenizer : ITokenizer
 2{
 3    private TextReader _reader;
 4    private char _currentChar;
 5    private Token _currentToken;
 6    
 7    public Token Token => _currentToken;
 8
 9    public Tokenizer(TextReader reader)
10    {
11        _reader = reader;
12		
13        GetNextChar();
14    }
15    
16    public Token GetNextToken()
17    {
18        if (_currentChar == '\0') return new Token("", TokenType.EOF);
19
20        // Skip whitespace
21        while (char.IsWhiteSpace(_currentChar))            
22            GetNextChar(); 
23        
24        // code to write
25
26        throw new InvalidDataException($"Unexpected character: {_currentChar}");
27    }
28
29    private void GetNextChar()
30    {
31        var ch = _reader.Read();
32        _currentChar = ch < 0 ? '\0' : (char)ch;
33    }
34}

This code is fairly straightforward: we split the input string into a list of characters and then parse them one by one. Whitespace characters are ignored and immediately discarded. If we reach the end of the string (EOF character), we simply return a corresponding token to indicate that.

Detecting operators and parentheses

These tokens are quite easy to identify, and the following code reflects a conventional approach commonly found in lexical analyzers.

 1public Token GetNextToken()
 2{
 3    if (_currentChar == '\0') return new Token("", TokenType.EOF);
 4
 5    // Skip whitespace
 6    while (char.IsWhiteSpace(_currentChar))            
 7        GetNextChar(); 
 8    
 9    // Detect operators and parentheses
10    switch (_currentChar)
11    {
12        case '+':
13            GetNextChar();
14            return new Token("+", TokenType.OPERATOR);  
15        case '-':
16            GetNextChar();
17            return new Token("-", TokenType.OPERATOR);
18        case '*':
19            GetNextChar();
20            return new Token("*", TokenType.OPERATOR);
21        case '/':
22            GetNextChar();
23            return new Token("/", TokenType.OPERATOR);
24        case '^':
25            GetNextChar();
26            return new Token("^", TokenType.OPERATOR);
27        case '(':
28            GetNextChar();
29            return new Token("(", TokenType.PARENTHESIS);
30        case ')':
31            GetNextChar();
32            return new Token(")", TokenType.PARENTHESIS);
33        default:
34            break;
35    }
36
37    throw new InvalidDataException($"Unexpected character: {_currentChar}");
38}

It's essentially a simple mapping between the current character and its corresponding token.

Detecting numbers

Numbers are a bit more complex to identify, as they consist of multiple characters rather than just one (for instance, 3.14 requires four characters). Therefore, we need to buffer the characters we've read so far and continue looping until we're certain we've captured the entire numeric value.

 1public Token GetNextToken()
 2{
 3    if (_currentChar == '\0') return new Token("", TokenType.EOF);
 4
 5    // Skip whitespace
 6    while (char.IsWhiteSpace(_currentChar))            
 7        GetNextChar(); 
 8    
 9    // Detect oprators and parentheses
10    // omitted for brevity
11
12    // Detect numbers    
13    if (char.IsDigit(_currentChar) || _currentChar == '.')
14    {
15        var sb = new StringBuilder();
16        var haveDecimalPoint = false;
17        while (char.IsDigit(_currentChar) || (!haveDecimalPoint && _currentChar == '.'))
18        {
19            sb.Append(_currentChar);
20            haveDecimalPoint = _currentChar == '.';
21            GetNextChar();
22        }
23        
24        var number = double.Parse(sb.ToString(), CultureInfo.InvariantCulture);
25        return new Token(number.ToString(), TokenType.CONSTANT);
26    }
27
28    throw new InvalidDataException($"Unexpected character: {_currentChar}");
29}

Detecting functions and variables

It would seem relatively straightforward at first glance to detect built-in functions (like cos or exp), given that their numbers are finite in each language. The code snippet below, for instance, could accomplish this task.

 1var keywords = [ "cos", "sin", "exp" ];
 2var input = "2*x + cos(y)";
 3var lexemes = input.Split(new string[] { " ", "+", "*", "(", ")" }, StringSplitOptions.None);
 4var tokens = new List<Token>();
 5foreach (var lexeme in lexemes)
 6{
 7    if (keywords.Contains(lexeme))
 8    {
 9        var token = new Token() { Lexeme = lexeme, Category = "FUNCTION" };
10        tokens.Add(token);
11    }
12}

However, challenges arise when it comes to detecting variables. As variables can essentially be any sequence of characters (like $x$ or $foo1234$), it is impractical to predefine all possible user-created variables, unlike the approach used in the code above. This is where a specialized data structure comes into play – the finite automaton.

What are finite automata ?

A finite automaton is a computational model with a finite set of states, transitions between these states, and an input alphabet that drives state transitions. It operates in discrete steps, processing input symbols to move between states based on predefined rules.

  • States are a finite set of distinct conditions or situations that the automaton can be in at any given time.

  • Transitions are rules or functions that define how the automaton transitions from one state to another in response to input symbols.

  • The input alphabet is the set of symbols that the automaton reads as input during its operation.

  • The initial state is the starting point of the automaton's operation, indicating its initial condition.

  • The accepting states are the states that, when reached, signify the successful recognition or acceptance of an input sequence.

Information

The theory underpinning finite automata is highly intricate and fascinating. While we won't explore profound theoretical concepts here, such as deterministic or nondeterministic automata or the remarkable Kleene's theorem, readers intrigued by these subjects can refer to Introduction to Automata Theory, Languages, and Computation (Hopcroft, Motwani, Ullman) for more comprehensive insights.

The above definition is quite formal, and readers who are not well-versed in these data structures may not find it more accessible. Consequently, we will take a graphical approach to illustrate and make tangible this incredibly useful structure.

A simple worked example

Imagine a situation where we aim to design an automaton for identifying the word 'cos'. The subsequent image illustrates the steps to take in this process.

  • Commencing from state 1 (depicted in red in the figure above) and utilizing the string "cos", we initiate the process. Since the string begins with an 'c', we attempt to transition to a state with an 'c'-labeled transition. As such a transition exists, we progress to state 2.
  • Subsequently, encountering the character 'o', we iterate through the process again—transitioning from state 2 to another state with an 'o'-labeled transition.
  • This pattern continues until we reach state 4. In our example, state 4 is highlighted in green, indicating an accepting state. This signifies that we can conclude the process here and determine the recognized string, which, in this case, is precisely "cos".

Can this structure recognize the word 'exp' ?

Once more, we initiate the process from state 1 (our initial state). However, this time, we employ the string "exp". Given that this string commences with an 'e', we endeavor to transition to a state with an 'e'-labeled transition. As no such transition exists, we reach an impasse, and it becomes evident that this structure does not recognize the word 'exp'.

To rectify this, it is necessary to enhance our structure by introducing the requisite states.

Can this structure recognize the word 'co' ?

  • Once again, we commence the process from state 1, utilizing the string "co". As the string begins with a 'c', we attempt to transition to a state with a 'c'-labeled transition. Since such a transition exists, we proceed further.
  • Iterating through the process, we encounter a blockage in state 3 as the string is exhausted. Given that state 3 is not an accepting state, we deduce that this structure does not recognize the word 'co'.

And so ?

The structures described above are finite automata:

  • States are represented by the set of all circles (1 through 4 in our case).
  • Three transitions are labeled with individual characters; transition from one state to another is only possible if there exists a transition labeled with the current character.
  • The input alphabet consists of sequence of alphanumeric characters.
  • An initial state is designated in red.
  • An accepting state is identified in green.
Important

An automaton recognizes a string if and only if, starting from the initial state, it can traverse to one of the accepting states by transitioning through the constituent characters of the string.

On its own, the current data structure serves little practical purpose, as identifying the word 'cos' can be accomplished without such intricate machinery. However, envision how this structure can be employed to recognize variables.

Recognizing variables

For our illustration, let's suppose that variables consist solely of sequences of alphabetic characters, with numeric characters not permitted. For instance, "foo" is considered a variable, while "x03" is not, as it contains non-alphabetic characters (0 and 3). Additionally, we impose the requirement that a variable must consist of at least one character.

With this stipulation, the following automaton has the capability to identify a variable.

We initiate the process from state 1, employing the string "foo". As the string commences with an 'f', we endeavor to transition to a state with an 'f'-labeled transition. Successfully finding such a transition, we progress to state 2. Continuing through the process, we can consume all the characters and conclude in an accepting state. We conclude that 'foo' is recognized as a variable.

Important 1

We can observe that there is a transition from a state to itself in this case. Such transitions are permitted by the definition of an automaton.

Important 2

It is quite noteworthy how the final structure is straightforward in comparison to the task it is designed to perform. With just two states and a handful of transitions, we can now accomplish a process that might appear quite complex without the aid of such a data structure.

It's time to put these concepts into practice, and we will exemplify the aforementioned ideas. This is how we can define a finite automaton in C#.

 1public class FSM
 2{
 3    public List<FSMState> States { get; set; }
 4
 5    public List<FSMTransition> Transitions { get; set; }
 6
 7    public FSM()
 8    {
 9        States = new List<FSMState>();
10        Transitions = new List<FSMTransition>();
11    }
12
13    public bool Simulate(string pattern)
14    {
15        var currentState = States.FirstOrDefault(x => x.IsStart);
16        foreach (var c in pattern.ToLower())
17        {
18            var transitions = Transitions.Where(t => t.Start.Id == currentState.Id).ToList();
19            foreach (var transition in transitions)
20            {
21                currentState = transition.Transition(c);
22                if (currentState != null) break;
23            }
24
25            if (currentState == null) return false;
26        }
27
28        return currentState.IsEnd;
29    }
30}
31
32public class FSMState
33{
34    public string Id { get; set; }
35
36    public bool IsStart { get; set; }
37
38    public bool IsEnd { get; set; }
39}
40
41public class FSMTransition
42{
43    public FSMState Start { get; set; }
44
45    public FSMState End { get; set; }
46
47    public char Item { get; set; }
48
49    public FSMState Transition(char c)
50    {
51        if (Item == c) return End;
52        return null;
53    }
54}
Information

Note that the FSM class contains a Simulate method; its goal is to take a string as an argument and return true or false based on whether the automaton recognizes this string or not.

An example of a finite automaton for recognizing variables is provided below.

 1var fsm = new FSM();
 2
 3var state0 = new FSMState() { Id = "0", IsStart = true, IsEnd = false };
 4var state1 = new FSMState() { Id = "1", IsStart = false, IsEnd = true };
 5
 6var transitions0 = new List<FSMTransition>();
 7foreach (var c in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
 8{
 9    var transition = new FSMTransition() { Start = state0, End = state1, Item = c };
10    transitions0.Add(transition);
11}
12
13var transitions1 = new List<FSMTransition>();
14foreach (var c in "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789")
15{
16    var transition = new FSMTransition() { Start = state1, End = state1, Item = c };
17    transitions1.Add(transition);
18}
19
20fsm.States = new List<FSMState> { state0, state1 };
21fsm.Transitions = transitions0.Concat(transitions1).ToList();

Completing the tokenizer

With this setup in place, we are now equipped to detect functions and identifiers.

 1public Token GetNextToken()
 2{
 3    if (_currentChar == '\0') return new Token("", TokenType.EOF);
 4
 5    // Skip whitespace
 6    while (char.IsWhiteSpace(_currentChar))            
 7        GetNextChar(); 
 8    
 9    // Detect operators and parentheses
10    // omitted for brevity
11    
12    // Detect numbers
13    // omitted for brevity
14
15    // Detect functions and variables
16    if (char.IsLetter(_currentChar))
17    {
18        var sb = new StringBuilder();
19        while (char.IsLetter(_currentChar) || char.IsDigit(_currentChar))
20        {
21            sb.Append(_currentChar);
22            GetNextChar();
23        }
24
25        var lexeme = sb.ToString();
26        foreach (var category in _fsms.Keys)
27        {
28            var fsm = _fsms[category];
29            if (fsm.Simulate(lexeme))
30                return new Token(lexeme, (TokenType)Enum.Parse(typeof(TokenType), category));                    
31        }
32    }
33
34    throw new InvalidDataException($"Unexpected character: {_currentChar}");
35}

Each time we encounter a letter, we populate a buffer until we have captured all the corresponding characters. Then, we need to determine whether the word formed is a variable or a predefined function. To do this, we simulate the finite automata we previously defined for variables and functions.

Putting it all together

We can now go ahead and define a Function class, using our tokenizer to carry out the first step of lexical analysis.

 1public class Function
 2{
 3    private ITokenizer _tokenizer;
 4    private Expression _inner;
 5
 6    public string Definition => _inner.ToString();
 7
 8    private Function(Expression inner)
 9    {
10        _inner = inner;
11    }
12
13    public Function(string definition)
14    {
15        _tokenizer = new Tokenizer(new StringReader(definition));
16		
17        // code to write (see next post)
18    }

This is just the beginning, as the class doesn’t do anything particularly meaningful at this stage: it simply takes a string as input (the mathematical expression we want to parse) and initializes a tokenizer. It will be the job of the syntactic analyzer to use this tokenizer to build an expression tree, which will be the focus of the next post.

Engineering a compiler through examples: building a mathematical expression engine - Part 5