Writing a simple lexical analyzer for SQL - Part 4
In this concluding post, we will put into practice what was discussed earlier by implementing it in C#.
Defining tokens, categories and more
As we defined earlier what a category was, we now have to explicitly code it in C#. We will simply use enums for that.
1public enum Category { OPERATOR, KEYWORD, IDENTIFIER, PUNCTUATION, NUMBER, LITERAL, ALL }
We follow a similar process for tokens.
1public class Token
2{
3 public string Lexeme { get; set; }
4
5 public Category Category { get; set; }
6
7 public Token(string lexeme, Category category)
8 {
9 Lexeme = lexeme;
10 Category = category;
11 }
12
13 public override string ToString()
14 {
15 return $"({Category.ToString()}|{Lexeme})";
16 }
17}
That is quite straightforward, and with these definitions, we can now proceed to implement finite automata.
1public class FSM
2{
3 public List<char> Sigma { get; set; }
4
5 public List<FSMState> States { get; set; }
6
7 public List<FSMTransition> Transitions { get; set; }
8
9 public FSM()
10 {
11 States = new List<FSMState>();
12 Transitions = new List<FSMTransition>();
13 }
14
15 public bool Simulate(string pattern)
16 {
17 var currentState = States.FirstOrDefault(x => x.IsStart);
18 foreach (var c in pattern.ToLower())
19 {
20 var transitions = Transitions.Where(t => t.Start.Id == currentState.Id).ToList();
21 foreach (var transition in transitions)
22 {
23 currentState = transition.Transition(c);
24 if (currentState != null) break;
25 }
26
27 if (currentState == null) return false;
28 }
29
30 return currentState.IsEnd;
31 }
32}
33
34public class FSMState
35{
36 public string Id { get; set; }
37
38 public bool IsStart { get; set; }
39
40 public bool IsEnd { get; set; }
41}
42
43public class FSMTransition
44{
45 public FSMState Start { get; set; }
46
47 public FSMState End { get; set; }
48
49 public char Item { get; set; }
50
51 public FSMState Transition(char c)
52 {
53 if (Item == c) return End;
54 return null;
55 }
56}
Note that the FSM class contains a Simulate method; its goal is to take a string as an argument and return true or false based on whether the automaton recognizes this string or not.
Defining the lexical analyzer
First and foremost, we need to define a general interface for a lexical analyzer (also referred as a tokenizer).
1public interface ITokenizer
2{
3 bool TryTokenize(out List<Token> tokens, out List<string> errors);
4}
This method is quite intuitive: a lexical analyzer will take a raw string as an argument and attempt to break it down into tokens. It also returns errors if there are any.
In our scenario, where we are implementing a lexical analyzer for SQL, we need to create a concrete class for that.
1public class SqlTokenizer : ITokenizer
2{
3 private TextReader _reader;
4 private char _currentChar;
5
6 private Dictionary<string, FSM> _fsms;
7 private Dictionary<string, IFSMBuilder> _builders;
8
9 public SqlTokenizer(string sql)
10 {
11 _reader = new StringReader(sql);
12 _fsms = new Dictionary<string, FSM>();
13
14 _builders = new Dictionary<string, IFSMBuilder>
15 {
16 { Category.OPERATOR.ToString(), new OperatorFSMBuilder() },
17 { Category.KEYWORD.ToString(), new KeywordFSMBuilder() },
18 { Category.IDENTIFIER.ToString(), new IdentifierFSMBuilder() }
19 };
20
21 foreach (var category in _builders.Keys)
22 {
23 var fsmBuilder = _builders[category];
24 _fsms.Add(category, fsmBuilder.Build());
25 }
26
27 GetNextChar();
28 }
29
30 public bool TryTokenize(out List<Token> tokens, out List<string> errors)
31 {
32 tokens = new List<Token>(); errors = new List<string>();
33
34 while (_currentChar != '\0')
35 {
36 var token = GetNextToken();
37 tokens.Add(token);
38 }
39
40 return true;
41 }
42
43 private Token GetNextToken()
44 {
45 // Skip whitespace
46 while (char.IsWhiteSpace(_currentChar))
47 GetNextChar();
48
49 switch (_currentChar)
50 {
51 case '*':
52 GetNextChar();
53 return new Token("*", Category.ALL);
54 case '=':
55 GetNextChar();
56 return new Token("=", Category.OPERATOR);
57 case '(':
58 GetNextChar();
59 return new Token("(", Category.PUNCTUATION);
60 case ')':
61 GetNextChar();
62 return new Token(")", Category.PUNCTUATION);
63 case '.':
64 GetNextChar();
65 return new Token(".", Category.PUNCTUATION);
66 case ',':
67 GetNextChar();
68 return new Token(",", Category.PUNCTUATION);
69 default:
70 break;
71 }
72
73 if (char.IsDigit(_currentChar) || _currentChar == '.')
74 {
75 var sb = new StringBuilder();
76 var haveDecimalPoint = false;
77 while (char.IsDigit(_currentChar) || (!haveDecimalPoint && _currentChar == '.'))
78 {
79 sb.Append(_currentChar);
80 haveDecimalPoint = _currentChar == '.';
81 GetNextChar();
82 }
83
84 var number = double.Parse(sb.ToString(), CultureInfo.InvariantCulture);
85 return new Token(number.ToString(), Category.NUMBER);
86 }
87
88 if (char.IsLetter(_currentChar))
89 {
90 var sb = new StringBuilder();
91 while (char.IsLetter(_currentChar) || char.IsDigit(_currentChar))
92 {
93 sb.Append(_currentChar);
94 GetNextChar();
95 }
96
97 var lexeme = sb.ToString();
98 foreach (var category in _fsms.Keys)
99 {
100 var fsm = _fsms[category];
101 if (fsm.Simulate(lexeme))
102 return new Token(lexeme, (Category)Enum.Parse(typeof(Category), category));
103 }
104 }
105
106 if (_currentChar == '\'')
107 {
108 GetNextChar();
109
110 var sb = new StringBuilder();
111 while (_currentChar != '\'')
112 {
113 sb.Append(_currentChar);
114 GetNextChar();
115 }
116
117 GetNextChar();
118
119 return new Token(sb.ToString(), Category.LITERAL);
120 }
121
122 if (_currentChar == '<' || _currentChar == '>')
123 {
124 var c = _currentChar;
125 GetNextChar();
126
127 if (_currentChar == '=')
128 {
129 GetNextChar();
130 return new Token($"{c}=", Category.OPERATOR);
131 }
132 else
133 {
134 GetNextChar();
135 return new Token($"{c}=", Category.OPERATOR);
136 }
137 }
138
139 throw new InvalidDataException($"Unexpected character: {_currentChar}");
140 }
141
142 private void GetNextChar()
143 {
144 var ch = _reader.Read();
145 _currentChar = ch < 0 ? '\0' : (char)ch;
146 }
147}
This code is particularly interesting as it demonstrates how basic processing can sometimes be seamlessly integrated with more complex concepts. For instance, when encountering a right parenthesis (')') or a simple dot ('.'), we simply return the concerned token. However, when more nuanced distinctions are required, such as differentiating between an identifier, the 'LIKE' operator, or a keyword, we employ more advanced techniques, such as finite automata.
Additionally, we have defined helper methods, GetNextChar() and GetNextToken(). These functions are fundamental and commonly used when implementing a lexical analyzer.
Note that this class generates an automaton for a few categories and implements the TryTokenize method by simulating each automaton one by one until one recognizes the lexeme at hand. If none of them succeeds, we handle an error for this lexeme. For example, this is what the finite automaton for identifiers looks like.
1public class IdentifierFSMBuilder : IFSMBuilder
2{
3 public FSM Build()
4 {
5 var sigma = new List<char>();
6 for (var c = 'A'; c <= 'Z'; c++)
7 sigma.Add(c);
8 for (var c = 'a'; c <= 'z'; c++)
9 sigma.Add(c);
10
11 var fsm = new FSM();
12 fsm.Sigma = sigma;
13
14 var state0 = new FSMState() { Id = "0", IsStart = true, IsEnd = false }; fsm.States.Add(state0);
15 var stateEnd = new FSMState() { Id = "1", IsStart = false, IsEnd = true }; fsm.States.Add(stateEnd);
16 foreach (var letter in sigma)
17 {
18 var transition0 = new FSMTransition() { Start = state0, End = stateEnd, Item = letter };
19 fsm.Transitions.Add(transition0);
20 var transitionEnd = new FSMTransition() { Start = stateEnd, End = stateEnd, Item = letter };
21 fsm.Transitions.Add(transitionEnd);
22 }
23
24 return fsm;
25 }
26}
Running the program
- SELECT * FROM Customers c WHERE c.Id = 10100
1(KEYWORD|SELECT)
2(ALL|*)
3(KEYWORD|FROM)
4(IDENTIFIER|Customers)
5(IDENTIFIER|c)
6(KEYWORD|WHERE)
7(IDENTIFIER|c)
8(PUNCTUATION|.)
9(IDENTIFIER|Id)
10(OPERATOR|=)
11(NUMBER|10100)
- SELECT c.Name, c.Age FROM Customers c WHERE c.Id = 10100 AND c.Age <= 35
1(KEYWORD|SELECT)
2(IDENTIFIER|c)
3(PUNCTUATION|.)
4(IDENTIFIER|Name)
5(PUNCTUATION|,)
6(IDENTIFIER|c)
7(PUNCTUATION|.)
8(IDENTIFIER|Age)
9(KEYWORD|FROM)
10(IDENTIFIER|Customers)
11(IDENTIFIER|c)
12(KEYWORD|WHERE)
13(IDENTIFIER|c)
14(PUNCTUATION|.)
15(IDENTIFIER|Id)
16(OPERATOR|=)
17(NUMBER|10100)
18(KEYWORD|AND)
19(IDENTIFIER|c)
20(PUNCTUATION|.)
21(IDENTIFIER|Age)
22(OPERATOR|<)
23(NUMBER|35)
- Customers SELECT WHERE Id FROM
1(IDENTIFIER|Customers)
2(KEYWORD|SELECT)
3(KEYWORD|WHERE)
4(IDENTIFIER|Id)
5(KEYWORD|FROM)
In SQL, this example is entirely nonsensical, yet it adheres to lexical correctness. It will be the responsibility of the parser to identify the error.
- SELECT * FROM Customers04
1The lexeme Customers04 is not recognized by the language.
This example violates the lexical rules defined for identifiers as it includes numeric characters.
Final thoughts
The lexical analyzer represents only the initial phase in the compilation process. It is succeeded by the syntax analyzer (or parser), which ensures that the output produced adheres to the grammar rules. Grammars and context-free languages constitute another intriguing field, but we will reserve that for a future article.
If you wish to delve deeper into this topic, acquire the following books, which encompass all the concepts emphasized in this series and delve into more advanced ones.
Introduction to Automata Theory, Languages, and Computation (Hopcroft, Motwani, Ullman)
Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom)
Do not hesitate to contact me shoud you require further information.