Writing a simple lexical analyzer for SQL - Part 4

In this concluding post, we will put into practice what was discussed earlier by implementing it in C#.

Defining tokens, categories and more

As we defined earlier what a category was, we now have to explicitly code it in C#. We will simply use enums for that.

1public enum Category { OPERATOR, KEYWORD, IDENTIFIER, PUNCTUATION, NUMBER, LITERAL, ALL }

We follow a similar process for tokens.

 1public class Token
 2{
 3    public string Lexeme { get; set; }
 4
 5    public Category Category { get; set; }
 6	
 7	public Token(string lexeme, Category category)
 8	{
 9	    Lexeme = lexeme;
10	    Category = category;
11	}
12
13    public override string ToString()
14    {
15        return $"({Category.ToString()}|{Lexeme})";
16    }
17}

That is quite straightforward, and with these definitions, we can now proceed to implement finite automata.

 1public class FSM
 2{
 3    public List<char> Sigma { get; set; }
 4
 5    public List<FSMState> States { get; set; }
 6
 7    public List<FSMTransition> Transitions { get; set; }
 8
 9    public FSM()
10    {
11        States = new List<FSMState>();
12        Transitions = new List<FSMTransition>();
13    }
14
15    public bool Simulate(string pattern)
16    {
17        var currentState = States.FirstOrDefault(x => x.IsStart);
18        foreach (var c in pattern.ToLower())
19        {
20            var transitions = Transitions.Where(t => t.Start.Id == currentState.Id).ToList();
21            foreach (var transition in transitions)
22            {
23                currentState = transition.Transition(c);
24                if (currentState != null) break;
25            }
26
27            if (currentState == null) return false;
28        }
29
30        return currentState.IsEnd;
31    }
32}
33
34public class FSMState
35{
36    public string Id { get; set; }
37
38    public bool IsStart { get; set; }
39
40    public bool IsEnd { get; set; }
41}
42
43public class FSMTransition
44{
45    public FSMState Start { get; set; }
46
47    public FSMState End { get; set; }
48
49    public char Item { get; set; }
50
51    public FSMState Transition(char c)
52    {
53        if (Item == c) return End;
54        return null;
55    }
56}
Information

Note that the FSM class contains a Simulate method; its goal is to take a string as an argument and return true or false based on whether the automaton recognizes this string or not.

Defining the lexical analyzer

First and foremost, we need to define a general interface for a lexical analyzer (also referred as a tokenizer).

1public interface ITokenizer
2{
3    bool TryTokenize(out List<Token> tokens, out List<string> errors); 
4}

This method is quite intuitive: a lexical analyzer will take a raw string as an argument and attempt to break it down into tokens. It also returns errors if there are any.

In our scenario, where we are implementing a lexical analyzer for SQL, we need to create a concrete class for that.

  1public class SqlTokenizer : ITokenizer
  2{        
  3    private TextReader _reader;
  4    private char _currentChar;
  5
  6    private Dictionary<string, FSM> _fsms;
  7    private Dictionary<string, IFSMBuilder> _builders;
  8
  9    public SqlTokenizer(string sql)
 10    {
 11        _reader = new StringReader(sql);
 12        _fsms = new Dictionary<string, FSM>();
 13
 14        _builders = new Dictionary<string, IFSMBuilder>
 15        {
 16            { Category.OPERATOR.ToString(), new OperatorFSMBuilder() },
 17            { Category.KEYWORD.ToString(), new KeywordFSMBuilder() },
 18            { Category.IDENTIFIER.ToString(), new IdentifierFSMBuilder() }
 19        };
 20
 21        foreach (var category in _builders.Keys)
 22        {
 23            var fsmBuilder = _builders[category];
 24            _fsms.Add(category, fsmBuilder.Build());
 25        }
 26
 27        GetNextChar();
 28    }
 29
 30    public bool TryTokenize(out List<Token> tokens, out List<string> errors)
 31    {
 32        tokens = new List<Token>(); errors = new List<string>();
 33
 34        while (_currentChar != '\0')
 35        {
 36            var token = GetNextToken();
 37            tokens.Add(token);
 38        }
 39
 40        return true;
 41    }
 42
 43    private Token GetNextToken()
 44    {
 45        // Skip whitespace
 46        while (char.IsWhiteSpace(_currentChar))
 47            GetNextChar();
 48
 49        switch (_currentChar)
 50        {
 51            case '*':
 52                GetNextChar();
 53                return new Token("*", Category.ALL);
 54            case '=':
 55                GetNextChar();
 56                return new Token("=", Category.OPERATOR);
 57            case '(':
 58                GetNextChar();
 59                return new Token("(", Category.PUNCTUATION);
 60            case ')':
 61                GetNextChar();
 62                return new Token(")", Category.PUNCTUATION);
 63            case '.':
 64                GetNextChar();
 65                return new Token(".", Category.PUNCTUATION);
 66            case ',':
 67                GetNextChar();
 68                return new Token(",", Category.PUNCTUATION);                
 69            default:
 70                break;
 71        }
 72
 73        if (char.IsDigit(_currentChar) || _currentChar == '.')
 74        {
 75            var sb = new StringBuilder();
 76            var haveDecimalPoint = false;
 77            while (char.IsDigit(_currentChar) || (!haveDecimalPoint && _currentChar == '.'))
 78            {
 79                sb.Append(_currentChar);
 80                haveDecimalPoint = _currentChar == '.';
 81                GetNextChar();
 82            }
 83            
 84            var number = double.Parse(sb.ToString(), CultureInfo.InvariantCulture);
 85            return new Token(number.ToString(), Category.NUMBER);
 86        }
 87
 88        if (char.IsLetter(_currentChar))
 89        {
 90            var sb = new StringBuilder();
 91            while (char.IsLetter(_currentChar) || char.IsDigit(_currentChar))
 92            {
 93                sb.Append(_currentChar);
 94                GetNextChar();
 95            }
 96
 97            var lexeme = sb.ToString();
 98            foreach (var category in _fsms.Keys)
 99            {
100                var fsm = _fsms[category];
101                if (fsm.Simulate(lexeme))
102                    return new Token(lexeme, (Category)Enum.Parse(typeof(Category), category));
103            }
104        }
105
106        if (_currentChar == '\'')
107        {
108            GetNextChar();
109
110            var sb = new StringBuilder();
111            while (_currentChar != '\'')
112            {
113                sb.Append(_currentChar);
114                GetNextChar();
115            }
116            
117            GetNextChar();
118
119            return new Token(sb.ToString(), Category.LITERAL);
120        }
121
122        if (_currentChar == '<' || _currentChar == '>')
123        {
124            var c = _currentChar;
125            GetNextChar();
126
127            if (_currentChar == '=')
128            {
129                GetNextChar();
130                return new Token($"{c}=", Category.OPERATOR);
131            }
132            else
133            {
134                GetNextChar();
135                return new Token($"{c}=", Category.OPERATOR);
136            }                
137        }
138
139        throw new InvalidDataException($"Unexpected character: {_currentChar}");
140    }
141
142    private void GetNextChar()
143    {
144        var ch = _reader.Read();
145        _currentChar = ch < 0 ? '\0' : (char)ch;
146    }
147}

This code is particularly interesting as it demonstrates how basic processing can sometimes be seamlessly integrated with more complex concepts. For instance, when encountering a right parenthesis (')') or a simple dot ('.'), we simply return the concerned token. However, when more nuanced distinctions are required, such as differentiating between an identifier, the 'LIKE' operator, or a keyword, we employ more advanced techniques, such as finite automata.

Information

Additionally, we have defined helper methods, GetNextChar() and GetNextToken(). These functions are fundamental and commonly used when implementing a lexical analyzer.

Note that this class generates an automaton for a few categories and implements the TryTokenize method by simulating each automaton one by one until one recognizes the lexeme at hand. If none of them succeeds, we handle an error for this lexeme. For example, this is what the finite automaton for identifiers looks like.

 1public class IdentifierFSMBuilder : IFSMBuilder
 2{
 3    public FSM Build()
 4    {
 5        var sigma = new List<char>();
 6        for (var c = 'A'; c <= 'Z'; c++)
 7            sigma.Add(c);
 8        for (var c = 'a'; c <= 'z'; c++)
 9            sigma.Add(c);
10
11        var fsm = new FSM();
12        fsm.Sigma = sigma;
13
14        var state0 = new FSMState() { Id = "0", IsStart = true, IsEnd = false }; fsm.States.Add(state0);
15        var stateEnd = new FSMState() { Id = "1", IsStart = false, IsEnd = true }; fsm.States.Add(stateEnd);
16        foreach (var letter in sigma)
17        {
18            var transition0 = new FSMTransition() { Start = state0, End = stateEnd, Item = letter };
19            fsm.Transitions.Add(transition0);
20            var transitionEnd = new FSMTransition() { Start = stateEnd, End = stateEnd, Item = letter };
21            fsm.Transitions.Add(transitionEnd);
22        }            
23
24        return fsm;
25    }
26}

Running the program

  • SELECT * FROM Customers c WHERE c.Id = 10100
 1(KEYWORD|SELECT)
 2(ALL|*)
 3(KEYWORD|FROM)
 4(IDENTIFIER|Customers)
 5(IDENTIFIER|c)
 6(KEYWORD|WHERE)
 7(IDENTIFIER|c)
 8(PUNCTUATION|.)
 9(IDENTIFIER|Id)
10(OPERATOR|=)
11(NUMBER|10100)
  • SELECT c.Name, c.Age FROM Customers c WHERE c.Id = 10100 AND c.Age <= 35
 1(KEYWORD|SELECT)
 2(IDENTIFIER|c)
 3(PUNCTUATION|.)
 4(IDENTIFIER|Name)
 5(PUNCTUATION|,)
 6(IDENTIFIER|c)
 7(PUNCTUATION|.)
 8(IDENTIFIER|Age)
 9(KEYWORD|FROM)
10(IDENTIFIER|Customers)
11(IDENTIFIER|c)
12(KEYWORD|WHERE)
13(IDENTIFIER|c)
14(PUNCTUATION|.)
15(IDENTIFIER|Id)
16(OPERATOR|=)
17(NUMBER|10100)
18(KEYWORD|AND)
19(IDENTIFIER|c)
20(PUNCTUATION|.)
21(IDENTIFIER|Age)
22(OPERATOR|<)
23(NUMBER|35)
  • Customers SELECT WHERE Id FROM
1(IDENTIFIER|Customers)
2(KEYWORD|SELECT)
3(KEYWORD|WHERE)
4(IDENTIFIER|Id)
5(KEYWORD|FROM)
Information

In SQL, this example is entirely nonsensical, yet it adheres to lexical correctness. It will be the responsibility of the parser to identify the error.

  • SELECT * FROM Customers04
1The lexeme Customers04 is not recognized by the language.
Information

This example violates the lexical rules defined for identifiers as it includes numeric characters.

Final thoughts

The lexical analyzer represents only the initial phase in the compilation process. It is succeeded by the syntax analyzer (or parser), which ensures that the output produced adheres to the grammar rules. Grammars and context-free languages constitute another intriguing field, but we will reserve that for a future article.

If you wish to delve deeper into this topic, acquire the following books, which encompass all the concepts emphasized in this series and delve into more advanced ones.

Introduction to Automata Theory, Languages, and Computation (Hopcroft, Motwani, Ullman)
Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom)

Do not hesitate to contact me shoud you require further information.